HBase (a) - HBase Introduction

HBase Introduction

1, relational databases and non-relational databases

(1) relational database

Relational Database

The most typical relational database table data organization is an organization of data and links between the two-dimensional table consisting of

Advantages:

1, easy to maintain: using a table structure is consistent format

2, easy to use: SQL common language, it can be used for complex queries

3, a complicated operation: the SQL support, can be used for very complex queries between a table and a plurality of tables

Disadvantages:

1, read and write performance is relatively poor, especially efficient mass data read and write

2, the fixed table structure, hence less flexibility

3, concurrent read and write high demand, the traditional relational database, disk IO is a big bottleneck

(2) non-relational databases

Non-relational databases

Non-relational database strictly not a database, it should be for a collection of structured data storage method, can be a document or key-value pairs

Advantages:

1, flexible format: The format of the stored data can be key, value form, the form of documents, pictures, forms, etc., in the form of documents, pictures, forms, etc., flexible, wide application scenarios, and only supports basic relational database type.

2, speed: nosql random access memory or a hard disk may be used as a carrier, and a relational database using a hard disk only

3, high scalability

4, low cost: nosql simple database deployment, basically open source software

Disadvantages:

1, does not provide support sql, learning and the use of higher costs;

2, no transaction

3, the data structure is relatively complicated, complex queries from less impressive

2, HBase Profile

    Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

HBase stands Hadoop Database, is a high-reliability, high performance, column-oriented, scalable, distributed real-time database to read and write.

Hadoop HDFS use as a file storage system, the use of massive data Hadoop MapReduce HBase to the use thereof as Zookeeper distributed collaboration services.

Loose primary data store for unstructured and semi-structured data (column memory NoSQL databases).

Note: NoSQL stands for Not Only SQL, refers to non-relational database.

3, HBase data model

hbase data model

(1)rowkey

Unique identifier (1) determines a row of data, each of the rows

(2) sorted lexicographically

(3) RowKey only 64K bytes of storage

(2)Column Family & Qualifier

(1) each column of the table HBase column belongs to a group, the group must be given in advance a portion of the column (schema) as defined in the table mode. The create 'test', 'course';

(2) column name as a prefix to the column group, each "column group" members can have a plurality of columns (column); The course: math, course: english, family members of the new column (columns) can subsequently needed , dynamically join;

(3) access control, storage and tuning are carried out in a column group level;

(4) HBase inside the same column group data is stored in the same directory, the file saved by several.

(3) TimeStamp timestamp

(1) In each HBase cell storage unit has a plurality of versions of the same data, to distinguish the differences between each version of the time stamp based on a unique, different versions of the data in reverse chronological order, most recent version of the data in row front.

Type (2) the time stamp is 64-bit integer.

(3) time stamp may (automatically writing data) assigned by HBase, this time stamp is accurate to the current system time in milliseconds.

(4) the time stamp can also be explicitly assigned by the customer, if the application data to avoid version conflict, you must generate your own unique time stamp of.

(4)Cell

(1) by the row and column coordinates of the intersection decision;

(2) there is a version of the cell;

Content (3) the cell is unresolved byte array;

​ 1、由{row key, column( = + ), Version} uniquely determined unit.
2, there is no data in the cell type, all stored in the form of a byte array.

4, HBase architecture

hbase Chart

Character introduction:

(1)Client

1, contains an interface to access HBase and maintain cache to speed up access to the HBase.

(2)Zookeeper

1, to ensure that any time there is only one active master cluster

2, all addressable storage region of the inlet

3, real-time monitoring region server on-line and off-line information in real time and notifies the master

4, and the table is stored in HBase metadata schema

(3)Master

1, the distribution region to region server

2, responsible for the region server load balancing

3, found that the failure of the region server and redistribute region on which

4, the user management table is changed to add or delete operation

(4)RegionServer

1, region server maintenance region, handling IO requests for the region's

2, region server is responsible for splitting the region from becoming too large during operation

Introducing regionserver

(1)region

1, HBase horizontal table divided automatically into a plurality of regions (region), each region will be continuous to save a certain data table

2, only one of each table Region started, with the continuous data into a table, region growing, when the time is increased to a threshold value, region two new clubs would like region (fission)

3, when the table rows in growing, there will be more and more of the region. Such a complete table is stored on multiple Regionserver.

(2)Memstore与storefile

1, a region composed of a plurality of store, a store corresponding to a CF (column family)

2, store includes a memory disk storefile located memstore and write operations to write memstore, memstore when the data reaches a certain threshold, hregionserver process starts flashcache write storefile, each write a single form storefile

3, when the number of storefile file grows to a certain threshold, the system will merge (minor, major), the merger process will be merged version and delete a job (majar), to form a larger storefile

4, when the size and number of all storefile a region exceeds a certain threshold value, will the current region is divided into two by hmaster regionserver assigned to the appropriate server, load balancing

5, the client retrieves data, first in memstore find, can not find to blockcache, can not find find storefile

Attention to the problem:

1, HRegion HBase is the minimum unit in the distributed storage and load balancing. It means a minimum unit may be distributed in different HRegion different HRegion server.

2, HRegion Store by one or more, each store a storage columns family.

3, each in turn consists of a Strore and 0 to memStore plurality StoreFile composition. FIG: StoreFile stored in the HDFS to HFile format.

Chart 3 hbase

5, HBase write process

(1) reading process

1, the client node regionserver get meta information from the table where the zookeeper

2, Client Access regionserver node meta table is located, to obtain information region where regionserver

3, regionserver clients to access a specific region where, find the corresponding region and store

4, first reads the data from the memstore, so if the read return data directly, if not, reading data to blockcache

5, if blockcache to read data, the data is returned directly to the client, if not read, then traverse storefile file, find the data

6, if the data is not read from storefile, the client is returned is empty if the read data, the data needs to be first be cached in blockcache (conveniently once read), and then returns the data to the client end.

7, blockcache is memory space, more if the cached data, will be used LRU policy after full, older data will be deleted.

(2) Write Process

1, the client node regionserver get meta information from the table where the zookeeper

2, Client Access regionserver node meta table is located, to obtain information region where regionserver

3, regionserver clients to access a specific region where, find the corresponding region and store

4, began to write the data, when writing data will first want to write a hlog data (data memstore can easily recover lost data in accordance with hlog after, is also a priority to hlog written to memory when writing data in the background there is a brush asynchronous thread periodically write data to hdfs, if hlog data can also write fails, then the data loss occurs)

5, after hlog write data is complete, the data is first written to the memstore, memstore default size is 64M, when the full memstore will be unified overflow write operation, data is persisted to memstore hdfs in

6, frequent overflow write can lead to a lot of small files, so will merge files, files in two ways, minor and major in the time of the merger, minor consolidation represents a small file, represents all the major storefile files are merged into one, specific and detailed process, follow-up will explain.

Guess you like

Origin www.cnblogs.com/littlepage/p/11293824.html