Recently, I am working on the knowledge reserve of big data, I have organized and written my own study notes, and briefly talked about Hbase data and design.
First, HBase is a distributed, column-oriented open source database, the technology is derived from the Google paper "Bigtable: A Distributed Storage System for Structured Data" written by Fay Chang. ) provides distributed data storage, HBase provides capabilities similar to Bigtable on top of Hadoop. HBase is a sub-project of Apache's Hadoop project. HBase is different from general relational databases, it is suitable for unstructured data. Storage database. Another difference is HBase's column-based rather than row-based schema.
HBase is a NoSQL database used to process massive data and can support large tables with 1 billion rows and millions of columns. Let's understand the table design of HBase database by comparing it with relational databases
The table structure of relational databases , in order to better understand the idea of HBase tables, here is a review of the processing methods of tables in relational databases
For example, there is a user table user_info with fields: id, name, tel, table name and fields need to be specified when creating the table
create table user_info (
id type,
name type,
tel type
)
Then insert two data insert into user_info values('...','...','...')
The table structure is as follows
id |
name |
tel |
1 |
Akari |
123 |
2 |
little king |
456 |
Later, the fields are not enough, and new users need to record the address , so they need to add a new field
id |
name |
tel |
addr |
1 |
Akari |
123 |
|
2 |
little king |
456 |
When the demand is increased in the future, continue to add new fields, or add an expansion table
The main contents of the above are:
- The way to create a table, you need to specify the table name and fields in advance
- The method of inserting records, specifying the table name and the value of each field
- A data table is a two-dimensional structure, with rows and columns
- Adding fields is not flexible
Let's take a look at how HBase handles it
HBase table structure
When creating a table, you need to specify: table name, column family
create table statement
create 'user_info', 'base_info', 'ext_info'
Means to create a new table, the name is user_info, contains two column families base_info and ext_info
A column family is a collection of columns, and a column family contains multiple columns
The table structure at this time:
row key |
base_info |
ext_info |
... |
... |
... |
Row key is the row key, the ID of each row, this field is created automatically, you don't need to specify it when creating the table
Insert a piece of user data: name is 'a', tel is '123'
insert statement
put 'user_info', 'row1', 'base_info:name', 'a'
put 'user_info', 'row1', 'base_info:tel', '123'
It means to add a data name:a to the base_info column family with row key row1 in the user_info table, and then add a data tel:123
name and tel are specific fields, which belong to the column family of base_info
The table structure at this time:
row key |
base_info |
ext_info |
row1 |
name:a, tel:123 |
Insert another piece of data: name is 'b', addr is 'beijing'
put 'user_info', 'row2', 'base_info:name', 'b'
put 'user_info', 'row2', 'ext_info:addr', 'bj'
The table structure at this time:
row key |
base_info |
ext_info |
row1 |
name:a, tel:123 |
|
row2 |
name:b |
addr:bj |
There is also an important concept in HBase tables: version , the value of each field has version information (specified by timestamp)
For example, base_info:name will retain the previous value every time it is modified, that is to say, its old value can be retrieved
row key |
base_info |
ext_info |
row1 |
name:a, tel:123 |
|
row2 |
name:c(v2)[name:b(v1)] |
addr:bj |
summary
From the above process of creating tables and inserting data, we can see the characteristics of HBase storage data.
- Like relational databases, it also uses a row and column structure
- When creating a table, the table name and column family (collection of fields) are defined, not specific fields
- A column family can contain any number of fields, the field names do not need to be predefined, and the fields in the same column family in each row can also be inconsistent
- Multidimensional structure, the table of relational database is two-dimensional, by referring to the row and column to locate a data, HBase needs to locate the specific data through the row key, column family name, field name, version number
- When inserting data, insert data of one field at a time, instead of inserting multiple fields at a time like a relational database