Essential dry goods | Hbase introduction and detailed data structure and table

Note: The content of this article is an excerpt from the new book "Distributed Machine Learning in Action" (Artificial Intelligence Science and Technology Series) [edited by Chen Jinglei] [Tsinghua University Press] by Mr. Chen Jinglei, founder, CEO and CTO of Charging .

Preface

Hbase is often used to store real-time data. For example, Storm/Flink/Spark-Streaming consumer behavior log data is processed and stored in Hbase. It can also be queried in milliseconds through Hbase's API. If it is to do non-real-time offline data statistics for Hbase, we can build a mapping table to Hbase through Hive, and then write Hive SQL to perform statistical analysis on Hbase data. And in this way, it is convenient to perform related queries with other Hive tables and do more complex statistics. Therefore, Hbase satisfies real-time and offline application scenarios from the interactive situation, and it is also very common in Internet companies.

Hbase principle and function introduction

HBase is a distributed, column-oriented open source database. The technology comes from the Google paper "Bigtable: A Distributed Storage System for Structured Data" written by Fay Chang. Just as Bigtable utilizes the distributed data storage provided by the Google File System (File System), HBase provides capabilities similar to Bigtable on top of Hadoop. HBase is a sub-project of Apache's Hadoop project. HBase is different from the general relational database, it is a database suitable for unstructured data storage. Another difference is that HBase is column-based rather than row-based storage mode.

1. Hbase features

1) HBase is built on HDFS
HBase is a distributed column storage system built on HDFS, which can query Hbase data through Hive.
2) HBase is a key/value system
HBase is developed based on the Google BigTable model, a typical key/value system.
3) HBase is used for massive structured data storage
HBase is an important member of the Apache Hadoop ecosystem and is mainly used for massive structured data storage.
4) Distributed storage
HBase stores data in tables, rows and columns. Like Hadoop, Hbase's goal mainly relies on horizontal expansion, increasing computing and storage capabilities by continuously adding cheap commercial servers.
5) Hbase table and columns are big
Hbase table has big characteristics: a table can have billions of rows and millions of columns.
6) No schema.
Each row has a sortable primary key and any number of columns. The columns can be dynamically added as needed. Different rows in the same table can have completely different columns, which is not possible with Mysql relational databases.
7) Column
-oriented column (family) storage and permission control, column (family) independent retrieval; sparse: empty (null) column does not occupy storage space, the table can be designed to be very sparse.
8) Multiple versions
of data The data in each cell can have multiple versions. The default is 3 versions, which are the timestamps when the cell is inserted.

2. The core components of Hbase's architecture

The core components of Hbase's architecture include Client, Hmaster, HRegionServer, ZooKeeper cluster is a coordination system, etc. The core is Hmaster, HRegionServer, Hmaster is the master node of Hbase, and HRegionServer is the slave node. Hbase must depend on the ZooKeeper cluster.
1) Client
accesses the interface of HBase and maintains Cache to speed up access to HBase, such as the location information of Region.
2) Hmaster
(1) Manage HRegionServer to achieve its load balancing;
(2) Manage and allocate HRegion, such as allocating a new HRegion when HRegion split; migrate its HRegion to other HRegionServer when HRegionServer exits;
(3) Realize DDL operations (Data Definition Language, namespace and table additions, deletions and modifications, column familiy additions, deletions and modifications, etc.);
(4) Management of namespace and table metadata (actually stored on HDFS);
(5) Access control (ACL).
3) HRegionServer
(1) Store and manage local HRegion;
(2) Read and write HDFS, manage data in Table;
(3) Client directly read and write data through HRegionServer (get metadata from HMaster, find HRegion/HRegionServer where RowKey is located Rear).
4) ZooKeeper cluster is the coordination system
(1) stores the metadata of the entire HBase cluster and cluster status information;
(2) Realize the failover of HMaster master and slave nodes;
HBase Client communicates with HMaster and HRegionServer through RPC. One HRegionServer can store 1000 HRegions. The underlying Table data is stored in HDFS, and the data processed by HRegion is as close as possible to the DataNode where the data is located. Together, achieve data localization.

Hbase data structure and table details

The Hbase data table is composed of row keys and column families. The row key can be considered as the primary key of the database. There can be multiple columns under a column family, and the columns can be dynamically added. This is the advantage of Hbase, which itself is a columnar storage. Database, this is different from Mysql relational database, once Mysql column is fixed, it cannot be dynamically added. At this point, Hbase is very flexible and can dynamically create a column according to business needs. Let me take a look at the structure of the following table:
1. Row Key is the
primary key used to retrieve records and access the rows in the Hbase Table.
2. The column family Column Family
Table consists of one or more ColumnFamily in the horizontal direction. A ColumnFamily can be composed of any number of Columns, that is, ColumnFamily supports dynamic expansion without pre-defining the number and type of Columns. All Columns are in binary format. For storage, users need to perform type conversion by themselves.
3. Column column is
composed of column family ColumnFamily + column name (cell) in Hbase.
4. Cell
Hbase determines the columns by row and columns, and a storage unit is called cell.
5. Version version
Each cell stores multiple versions of the same data. The versions are indexed by timestamps, and there are three versions by default.
6. The following is an example of Hbase data structure table, as shown in Table 3.1:
Insert picture description here

Table 3.1 Hbase table structure description

Explanation: In the example, there is a piece of data in the table, the primary key of rowkey is kc61800001, two column families, one is name, it has only one column kcname, and the other column family kcsaleinfo has two columns price and issale

to sum up

There is also a corresponding supporting video for this article . For more exciting articles, please download the charging app , you can get thousands of free lessons and articles. For supporting textbooks, please see Chen Jinglei’s new book: "Distributed Machine Learning Practice" (Artificial Intelligence Science) And Technology Series)

[New book introduction]
"Distributed machine learning in practice" (artificial intelligence science and technology series) [edited by Chen Jinglei] [Tsinghua University Press]
Features of the new book: Explain the framework of distributed machine learning and its application supporting personalized recommendation algorithm system step by step , Face recognition, dialogue robots and other practical projects

[New book introduction video]
Distributed machine learning practice (artificial intelligence science and technology series) new book [Chen Jinglei]

Video features: focus on the introduction of new books, analysis of the latest cutting-edge technology hotspots, and technical career planning suggestions! After listening to this lesson, you will have a brand new technological vision in the field of artificial intelligence! Career development will also have a clearer understanding!

[Excellent Course]
"Distributed Machine Learning Practical Combat" Big Data Artificial Intelligence AI Expert-level Excellent Course

[Free experience video]:

Artificial intelligence million annual salary growth route / from Python to the latest hot technology

From the beginner's introduction to Python programming with zero foundation to the advanced practical series of artificial intelligence courses

Video features: This series of expert-level high-quality courses has a corresponding supporting book "Distributed Machine Learning Practical Combat". The high-quality courses and books can complement each other and complement each other, which can greatly improve learning efficiency. The core content of the course includes Internet company big data and artificial intelligence, big data algorithm system architecture, big data foundation, Python programming, Java programming, Scala programming, Docker container, Mahout distributed machine learning platform, Spark distributed machine learning platform, Distributed deep learning framework and neural network algorithm, natural language processing algorithm, industrial-grade complete system combat (recommended algorithm system combat, face recognition combat, dialogue robot combat), employment/interview skills/career planning/promotion guidance, etc. .

[Is it charged? Company introduction]

Rechargeable App is an online education platform focusing on rechargeable learning for vocational training for office workers.

Focus on the improvement and learning of work vocational skills, improve work efficiency, and bring economic benefits! Are you charging today?

Is it charged? App official website download address
https://a.app.qq.com/o/simple.jsp?pkgname=com.charged.app

Features:

【Full Industry Positions】-Focus on improving the vocational skills of office workers

【Niuren Classroom】-Learn the work experience of Niuren

【Excellent Reading】-Interesting reading of skill articles

【Short Course】-Learn knowledge efficiently

Guess you like

Origin blog.csdn.net/weixin_52610848/article/details/113254931