Hbase summary (5)-hbase common sense and what scenario is habse suitable for

What database is suitable to use when we are not sure about the data structure fields or the data is disorganized and difficult to extract according to a concept? What is the answer? If we use a traditional database, there must be redundant fields, 10 will not work, 20, but this will seriously affect the quality. And if faced with large databases and pt-level data, this waste is even more serious, then what database should we use? There are several good choices for hbase, so we still have the following problems with hbase:

 

1. What does Column Family stand for?
2. HBase determines a piece of data through row and column. The value of this data may have multiple versions. Why do there exist multiple versions?
3. Which version will be displayed when querying?
4. What are their storage types?
5. What is the type of tableName?
6. What type are RowKey and ColumnName?
7. What type of Timestamp?

8. What is the type of value?

Read the following with the above questions in mind:

 

Introduction There are more projects using HBase in the team. For business personnel, it is usually not necessary to build and maintain a HBase cluster environment from scratch, and it is not necessary to have a deep understanding of its architectural details (the HBase cluster maintenance team is responsible for it). ), what is urgently needed is a quick understanding of fundamental techniques to solve business problems. Recently, during the rotation process of the XX project, I tried to look at HBase from the perspective of business personnel, and recorded some processes, hoping to help business personnel who quickly understand HBase and master related technologies to carry out their work. I think that as a business development tester who is new to HBase, he needs to master at least the following points: in-depth understanding of HTable, how to design high-performance HTable in combination with business, and mastering the interaction with HBase, anyway, it is inseparable from data. Adding, deleting, modifying and checking, through HBase Shell commands and Java  Api are all needed to master how to use MapReduce to analyze the data in HBase, the data in HBase always needs to be analyzed, using MapReduce is one of the ways to master how to test HBase MapReduce, it is always not only Write regardless of the correctness. Debugging is necessary. Let’s see how to debug the single test on this machine. This series will focus on the above points. Yes, you can choose to read, for example, pay attention to HBase's MapReduce and its testing methods. Starting from an example , traditional relational databases must be familiar to everyone. We will use a simple example to illustrate the respective solutions, advantages and disadvantages of using RDBMS and HBase. Taking the blog post as an example, the table design of the RDBMS is as follows:



















 
In order to facilitate understanding, we take some data examples below



 

 
In the above example, we can use HBase to design in the following way



 
Also for the convenience of understanding, we use some data examples and mark some key concepts in red, which will be explained later



 

HTable一些基本概念

Row key


行主键, HBase不支持条件查询和Order by等查询,读取记录只能按Row key(及其range)或全表扫描,因此Row key需要根据业务来设计以利用其存储排序特性(Table按Row key字典序排序如1,10,100,11,2)提高性能。

Column Family(列族)

在表创建时声明,每个Column Family为一个存储单元。在上例中设计了一个HBase表blog,该表有两个列族:article和author。

Column(列)

HBase的每个列都属于一个列族,以列族名为前缀,如列article:title和article:content属于article列族,author:name和author:nickname属于author列族。
Column不用创建表时定义即可以动态新增,同一Column Family的Columns会群聚在一个存储单元上,并依Column key排序,因此设计时应将具有相同I/O特性的Column设计在一个Column Family上以提高性能。同时这里需要注意的是:这个列是可以增加和删除的,这和我们的传统数据库很大的区别。所以他适合非结构化数据。

Timestamp

HBase通过row和column确定一份数据,这份数据的值可能有多个版本,不同版本的值按照时间倒序排序,即最新的数据排在最前面,查询时默认返回最新版本。如上例中row key=1的author:nickname值有两个版本,分别为1317180070811对应的“一叶渡江”和1317180718830对应的“yedu”(对应到实际业务可以理解为在某时刻修改了nickname为yedu,但旧值仍然存在)。Timestamp默认为系统当前时间(精确到毫秒),也可以在写入数据时指定该值。
Value

每个值通过4个键唯一索引,tableName+RowKey+ColumnKey+Timestamp=>value,例如上例中{tableName=’blog’,RowKey=’1’,ColumnName=’author:nickname’,Timestamp=’ 1317180718830’}索引到的唯一值是“yedu”。

存储类型

TableName 是字符串
RowKey 和 ColumnName 是二进制值(Java 类型 byte[])
Timestamp 是一个 64 位整数(Java 类型 long)
value 是一个字节数组(Java类型 byte[])。


存储结构

可以简单的将HTable的存储结构理解为



 
即HTable按Row key自动排序,每个Row包含任意数量个Columns,Columns之间按Column key自动排序,每个Column包含任意数量个Values。理解该存储结构将有助于查询结果的迭代。


In other words,

the semi-structured or unstructured data of HBase is not enough

for the data structure fields or the data that is disorganized and difficult to extract according to a concept is suitable for HBase. Taking the above example as an example, when the business development needs to store the author's email, phone, and address information, the RDBMS needs to be shut down for maintenance, and HBase supports dynamic increase. The number of columns in the RDBMS rows with


very sparse records is fixed, and the columns that are null are wasted.

storage space. As mentioned above, HBase null columns will not be stored, which not only saves space but also improves read performance.


As mentioned above, the multi-version data

can have any number of version values ​​according to the value located by the Row key and the Column key, so it is very convenient to use HBase for the data that needs to store the change history. For example, the address of the author in the above example will change. Generally, only the latest value is required in business, but sometimes historical values ​​may need to be queried. Large


amount of data

When the amount of data becomes larger and larger, and the RDBMS database can no longer support it, a read-write separation strategy emerges. One master is responsible for write operations, and multiple slaves are responsible for read operations, which doubles the server cost. As the pressure increases, the Master can't hold it anymore. At this time, it is necessary to split the database and deploy the data with little correlation. Some join queries cannot be used, and the middle layer needs to be used. As the amount of data further increases, the records of a table become larger and larger, and the query becomes very slow, so it is necessary to divide the table, such as dividing into multiple tables by ID modulo to reduce the number of records in a single table. Anyone who has been through these things knows how frustrating the process can be. It is simple to use HBase, only need to add machines, HBase will automatically divide and expand horizontally, and the seamless integration with Hadoop ensures its data reliability (HDFS) and high performance of massive data analysis (MapReduce).

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326988586&siteId=291194637