Datawahle team study - fun big data Day3

An introductory course on Datawhale's insert image description here
big data technology-related content: Juicy Big Data


4. HBase

1. Background

  1. Limitations of Hadoop : Access data in a batch and sequential manner, unable to achieve random access to data

  2. Classification of data structures : structured data, semi-structured data, unstructured data

  3. To store different data structures, the database includes:

    Relational database (MySQL), key-value storage database (Redis), column storage database (HBase), document-oriented database (MongoDB), graph database (Neo4J), search engine database (Solr)

  4. The main differences between HBase and traditional relational databases are: data type (stored as an uninterpreted string), data operation (will not fully normalize the data), storage mode (column storage), data index (support row key index ), data maintenance (retain for a period of time), scalability (good horizontal scalability)

2. Overview of HBase

  1. A highly reliable, high-performance, column-oriented, and scalable distributed database built on the Hadoop file system , mainly used to store unstructured and semi-structured loose data.
  2. Provides fast random access to large volumes of structured data.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-mASUsLA3-1676984437639)(null)]

3. HBase data model

Related concepts

  • Table : HBase uses tables to organize data. Tables are composed of rows and columns, and columns are divided into several column families.
  • Row : Each HBase table consists of several rows, and each row is identified by a row key.
  • Column family : An HBase table is grouped into a collection of many "Column Family", which is the basic access control unit. Each column in the table belongs to a certain column family, and data can be stored under a certain column of the column family (the column family needs to be created first). After creating a column family, you can use the columns in the same column family. Column names are prefixed with the column family. For example, courses:historyand courses:mathboth columns belong to coursesthis column family.
  • Column qualifiers : Data in a column family is located by column qualifiers (or columns).
  • Cell : In an HBase table, a "cell" is determined by the row, column family, and column qualifier. The data stored in the cell has no data type and is always regarded as a byte array byte[].
  • Timestamp : Each cell holds multiple versions of the same data, and these versions are indexed by timestamp.
img

data coordinates

  1. A " four-dimensional coordinate ", namely[行键, 列族, 列限定符, 时间戳]

  2. HBase can be regarded as a key-value database : 'four-dimensional coordinates' (key), cell content (value)

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-Z9WWDcBl-1676984436161)(null)]

conceptual view

A table can be viewed as a sparse, multidimensional mapping

  • eg: A fragment of an HBase table that stores web pages: each row contains the same column family, and the row does not need to store data in each column family

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-8nv7fKkH-1676984432775)(null)]

physical view

Use column-based storage (the biggest difference from traditional relational databases)

  • eg: The aforementioned conceptual view is stored physically, and the following two small fragments will be stored

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-rWdf7oyK-1676984434286)(null)]

column-oriented storage

  • Data is stored in columns, each column is stored separately
  • data is index
  • Only access the columns involved in the query, greatly reducing system IO
  • Each column is processed by a thread, and the query is processed concurrently
  • The data type is consistent, the data characteristics are similar, and the high-efficiency compression method is adopted
  • Disadvantage: When performing chain operations, expensive tuple reconstruction costs are required

4. HBase implementation principle

HBase functional components

  • Library Functions
    • for connecting to each client
  • A Master main server
    • Responsible for managing and maintaining the partition information of the HBase table, maintaining the list of Region servers, assigning Regions, and load balancing
  • Many Region Servers
    • Responsible for storing and maintaining the Region assigned to it, and processing read and write requests from clients
    • Client reads data: After obtaining the storage location information of the Region, read the data directly from the Region server
    • The client obtains the Region location information through Zookeeper, and most clients never even communicate with the Master

Tables and Regions

  • HBase stores many tables, and each HBase table contains a large number of rows (which cannot be stored on one machine)

  • An HBase table is divided into multiple Region partitions

    • Region contains all data within a certain value range, and is the basic unit of load balancing and data distribution
  • A Region will split into multiple new Regions

Region positioning

  • Each Region has one RegionIDto identify its uniqueness, so that a Region identifier can be expressed as表名+开始主键+RegionID
  • "Metadata table", also .META.known as table : Region identifier , Region server identifier
  • .META.The table will also be split into multiple Regions
  • "Root data table", -ROOT-` table: the specific location to record all metadata
    • -ROOT-The table cannot be divided, and there will always be only one Region for storing -ROOT-the table
    • The only Region where the table is stored -ROOT-, its name is hard-coded in the program, and the Master master server always knows its location

5. HBase operating mechanism

system structure

  • client

    • The client contains the interface to access HBase
    • RegionThe location information that has been accessed is maintained in the cache to speed up the subsequent data access process
  • Zookeeper server :

    • Help elect a Master as the manager of the cluster, and ensure that there is always only one Master running at any time, which avoids the "single point of failure" problem of the Master
    • good cluster management tool
  • Master server : The master server is mainly responsible for the management of tables and Regions:

    • Manage users to add, delete, modify, query and other operations on tables
    • Realize load balancing between different Region servers
    • Responsible for readjusting the distribution of the Region after the Region splits or merges
    • Migrate the Region on the failed Region server
  • Region server

    • The core module in HBase is responsible for maintaining the Region assigned to itself and responding to user read and write requests

How the Region server works

  1. User read and write data process
    • When a user writes data, it is assigned to the corresponding Region server to execute
    • User data is first written to MemStoreand Hlogin
    • Only after the operation is written Hlog, the calling commit()method will return it to the client
    • When the user reads data, the Region server will first access MemStorethe cache, if not found, then StoreFilelook for it in the disk
  2. cache refresh
    • The system will periodically MemStoreflush the contents of the cache to a file on the disk StoreFile, clear the cache, and Hlogwrite a mark in it
    • Each flash generates a new StoreFilefile, so each Storecontains multiple StoreFilefiles
    • Each Region server has its own HLogfile, which is checked every time it starts to confirm whether a new write operation has occurred after the latest cache refresh operation; if an update is found, write it first, then refresh it, MemStoreand StoreFilefinally Delete old Hlogfiles and start serving users
  3. StoreFilethe merger
    • A new one is generated every time a flash is written StoreFile, too many will affect the search speed
    • call Store.compact()to merge multiple StoreFileinto one
    • The merging operation is relatively resource-intensive, and the merging will only be started after the number reaches a certain threshold

Working principle of store

  • StoreIs the core of the Region server
  • multiple StoreFileinto oneStoreFile
  • When a single StoreFileis too large, the split operation is triggered again, and a parent Region is split into two child Regions

How HLog works

Ensure that the system can be restored to the correct state in the event of a failure

[Chapter 4: HBase (datawhalechina.github.io)](https://datawhalechina.github.io/juicy-bigdata/#/ch4 HBase?id=_432-table and region)

experiment


An introductory course on Datawhale's insert image description here
big data technology-related content: Juicy Big Data

Guess you like

Origin blog.csdn.net/qq_38869560/article/details/129070221