Hbase--storage structure

1. Storage concept features

Features

  • 优先读写内存
    • Write: write directly to memory
    • Read: read the memory first, if not, then read HDFS
  • All data in the entire Hbase are arranged in the order of Rowkey
  • When querying data, you canrowkey前缀匹配查询
  • Rowkey是整个Hbase的唯一索引
  • Data type: The underlying storage types of Hbase are all bytes
  • data structure:KV结构,每一列在底层存储都是一个KV
    • Key:Rowkey+列族名称+列的名称+时间戳
      • 20200101_001 column=basic:age, timestamp=1593247238403
    • Value:值
      • value=18

concept

RowKey:类似于主键的概念

  • Uniquely mark a line
  • 基于rowkey排序
  • 唯一索引

RegionServer:HBASE架构中的从节点

  • All Hbase data is written to the RegionServer, and data is also read from the RegionServer
    • Write: written into the memory of Regionserver
    • Read: read the memory of RegionServer first
  • There will be multiple slave nodes
    • 每一台RegionServer上都可以有很多个Region
    • Similar to DataNode
    • RegionServer是将数据存储在内存中
    • RegionServer will call HDFS client to write data to HDFS
    • DataNode是将数据存储在磁盘上

Region:RegionServer中的存储单元

  • Just数据表的分区
    Insert picture description here
  • 一个Region必然会被某一个RegionServer所管理
  • 一个RegionServer可以管理多个Region
    Insert picture description here

Thinking: I have a table, I have a lot of data, and I want to write to this table, how can I make this table distributed?

  • Only with this table 分布式, can I read and write data in parallel in a distributed manner?
  • Let a table have multiple partitions, and different partitions are stored on different machines
    • region:分区
    • regionserver:region所在的机器
  • Write data to the table, according to certain rules, write to a certain partition
    • Suppose there is a table with three partitions
      • region0:node-01
      • region1:node-02
      • region2:node-03
      • The first data: write to the first partition region0
      • The second piece of data: write to the second partition region1
      • The third piece of data: write to the third partition region2
      • ……
      • Implement distributed storage

In Hbase:

  • Each table has only one partition at the beginning by default
  • 每个Region都有一个范围
  • If a table has only one region: -oo ~ +oo
    Insert picture description here
  • If a table has two regions
    • region0 : -oo ~ 1000
    • region1:1000 ~ +oo
    • ……
  • If a table has five partitions
    • region0 : -oo ~ 10
    • region1:10 ~ 20
    • region2:20 ~ 30
    • region3:30 ~ 40
    • region4:40 ~ +oo
      Insert picture description here
  • 分区规则:根据rowkey所属的范围
    • analogy
concept HDFS Hbase
data file table
Split Block: Block 分区:Region
rule 128M a Block 范围
machine DataNode RegionoServer
Block has a copy 分区没有副本
  • Demo: a table has multiple partitions
create 'itcast01:t1', 'info', SPLITS => ['10', '20', '30', '40']

Insert picture description here

  • Partition rules: partition according to rowkey
    • 1111:region1
      • 10 < 1111 < 20
    • 5:region4
      • 40 < 5 < +oo
  • The relationship between Region and table
    • Region就是表的分区, A table has only one partition by default, and there can be multiple partitions
    • When writing data to the table, it will be written to different partitions of this table according to the partitioning rules
      Insert picture description here
  • The relationship between Region and RegionServer
    • Region is stored in RegionServer and managed by RegionServer
    • Region是RegionServer负载均衡的最小单元
      • regionserver1:10个region
      • regionserver2: 5 regions
      • regionserver3: 3 regions
      • |
      • |Load Balancing: Balanced Distribution of Regions
      • |
      • regionserver1:6个region
      • regionserver2: 6 regions
      • regionserver3: 6 regions
        Insert picture description here

Store: Smaller storage unit in Region

  • region是表在rowkey的方向上进行了划分多个分区
    • Rowkey : -oo ~ + oo
  • Store是表中列的划分,store就是列族
  • The data corresponding to each Rowkey in this table has a column family, and there is a Store in the corresponding region
  • The data of different column families are stored in different Stores, and the data of the same column families are stored together
    • Why design the concept of column family?
      • Just为了将经常一起读写的数据存在一起(相似io属性)
      • 加快读写的 效率

memstore: The memory area where data is stored in the Store

  • 每个store中都有一个memstore,用于存储写入的数据
  • 不同列族就是store的数据写入不同的内存
  • The memory of the regionserver where this region is used is used

StoreFile: The file refreshed from the memory in each Store is storefile

  • 一个Store中逻辑上可能会有多个storefile文件
  • Storefile is stored on HDFS, encapsulation of HFILE
    Insert picture description here

2. Logical structure

  • Put statement
put 'ns:tbname','rowkey',cf:col,value
  • RegionServer: slave node, used to store all data in Hbase
    • Table: All data exists in the form of tables, and reads and writes are specified tables
      • Region: to realize the association between the table and the RegionServer, the partition of the table, used to rowkeyturn the table into a distributed
        • Store: Each partition 列族is divided into stores according to internal partitions, smaller partitions
          • MemStore: The memory area of ​​Regionserver, each Store has 1 MemStore
          • StoreFIle: multiple, flashed data in the memory
            • Logically belongs to this store
            • Really stored on HDFS
              • 分区没有副本,如果分区所在的宕机了怎么办?
              • Hbase will restore this region on other regionservers
              • How can the data be guaranteed not to be lost?
                • StoreFile可以通过HDFS副本进行恢复
                • How to recover the data in memStore?
                  Insert picture description here
  • Monitoring logical structure
  • table
    Insert picture description here
  • Partition: Region
    Insert picture description here

3. Physical structure

  • The machine where the regionServer is located
    • 内存
    • can not see
  • The machine where the DataNode is located
    • Disk
  • Structure of storage on HDFS
    • NameSpace: corresponds to a directory in HDFS
      Insert picture description here
  • Table: Corresponding to a directory on HDFS
    Insert picture description here
  • Region: corresponds to a directory on HDFS
    Insert picture description here
  • Store: corresponds to a directory on HDFS, that is, column family
    Insert picture description here
  • StoreFile: File on HDFS
    Insert picture description here

4. Storage architecture

Insert picture description here

  • Step 1: The client finds the corresponding region based on the table name and rowkey
    • How does it know that this table has some regions?
    • How does it know the extent of each region?
    • How does it know which regionserver this region is on?
  • The second step:客户端会请求这个region所在的Regionserver,提交写请求
  • third step:RegionServer根据我们的请求来写入Region,根据列族来判断写入哪个Store
  • the fourth step:将这个数据读写Store中的MemStore
    • If it is read:先读memstore ,如果没有读StoreFIle
    • If writing: write directly to MemStore
      • When the data in MemStore reaches certain conditions, Flush will be triggered automatically
        • Flash the data in memStore into Storefile file and write it to HDFS

Guess you like

Origin blog.csdn.net/qq_46893497/article/details/114190926