Distributed NoSQL column store database Hbase Java API (4)

Distributed NoSQL column store database Hbase (4)

Knowledge point 01: Course review

  1. Hbase Java API DML

    • DML implementation rules

      • step1: Build the connection, the general mechanism for the client to connect to the server

        Configuration:用于管理配置,例如服务端地址【Hbase的JavaAPI服务端地址是ZK的地址】
        Connection:连接对象
        
        • MySQL: Build MySQL connection: hostname: 3306
        • Hive: Build a Hive connection: hostname:10000
      • step2: Build the table object for DML operations

        Table
        
        • Construct the administrator object for DDL operations: HbaseAdmin
    • put

      put		表名		rowkey		列族:列		值
      
      • step1: Build a Put object and specify the rowkey

      • step2: Specify the column family, column and value of the put object

        addColumn(列族、列、值)
        
      • step3: Perform Put operation on the table

        table.put(Put)
        table.put(List<Put>)
        
    • delete

      delete		表名		rowkey		列族:列
      
      • step1: construct the delete object, specify the rowkey

      • step2: Specify the deleted column or column family

        addColumn:删除指定列,最新版本
        addColumns:删除指定列的所有版本
        addFamily:删除指定列族
        
      • step3: Perform delete operation

        table.delete
        
    • get: fastest

      get		表名		rowkey		[列族:列]
      
      • Query based on rowkey, Rowkey is an index, and all queries are indexed

      • step1: first construct the Get object and specify the rowkey

      • step2: Specify column family or column according to requirements

      • step3: Return all data of the entire Rowkey

        • Return value: Result: a Result represents all the data of a Rowkey

           20210101_001                  column=basic:age, timestamp=1616034013623, value=18                                   
           20210101_001                  column=basic:name, timestamp=1616034013623, value=laoda                               
           20210101_001                  column=other:addr, timestamp=1616034013623, value=shanghai                            
           20210101_001                  column=other:phone, timestamp=1616034013623, value=110 
          
        • Iteratively take out each column: Cell: a Cell represents a column of data

          20210101_001                  column=basic:age, timestamp=1616034013623, value=18 
          
        • A Result contains a Cell array: rs.rawCells

        • Tools

          • Bytes: used to implement type conversion
            • Convert other types to byte types: write
            • Convert byte types to other types: read
          • CellUtil: used to fetch data from the Cell object
    • scan: full table scan, not commonly used

      scan  表名
      
      • step1: Build the Scan object

      • step2: Table execution Scan object: getScanner

      • step3: Get the return value, multiple Rowkey data

        • ResultScanner = Iter【Result】
    • Filter: Conditional query, the most commonly used

      scan 表名 Filter
      
      • step1: Build the Scan object
      • step2: Build filters based on query conditions
        • Range filtering
          • STARTROW: Specify to start from a Rowkey, including
          • STOPROW: designated to the end of a Rowkey, not including
        • Filter: Filter
          • Rowkey filter: PrefixFilter
          • Column value filter: SingleColumnValueFilter
          • Column filtering: MultipleColumnPrefixFilter
          • Combination filtering: FilterList
      • step3: Table execution Scan object: getScanner
      • step4: Get the return value, multiple Rowkey data
        • ResultScanner = Iter【Result】
  2. Learning methods of operational knowledge points

    • Commands: generally in English abbreviations, remember English and grammar, memorize

    • JavaAPI

      • Which categories: the role and construction of each category

      • Which methods: add, delete, modify, and check

Knowledge point 02: course objectives

  1. Hbase storage design
    • The storage structure in the entire Hbase?
      • Hbase、Zookeeper、HDFS
    • The relationship between Table and RegionServer?
    • How is Table distributed? What are the rules for dividing regions? What are the rules for writing data distribution? 【important】
    • Storage in the Region? 【important】
    • The relationship between Hbase data and HDFS?
  2. Hbase’s most serious problem: hot spots
    • Phenomenon, cause
    • Solution【Important】
      • Pre-partitioning: a table has multiple Region partitions
      • Table design: Rowkey’s design

Knowledge Point 03: Storage Design: Storage Architecture

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-yYfd67AX-1616633798599) (20210319_Distributed NoSQL column storage database Hbase (4).assets/image-20210317190105892. png)]

  • Question: How does Hbase realize data storage as a whole?

  • analysis

    • Client: Responsible for connecting to the server
      • Provide a development interface, submit the user's command or code to the server for execution
      • Return the result of the server execution to the user
    • Zookeeper: stores part of Hbase metadata
      • All Hbase clients need to connect to Zookeeper to obtain metadata
    • Hbase: Distributed memory
      • HMaster: Management functions
      • HRegionServer: Responsible for data storage, providing external client read and write
        • Distributed memory
    • HDFS: Distributed Disk
      • DataNode: Responsible for writing data in Hbase memory to disk

Knowledge point 04: Storage design: the relationship between Table, Region, and RegionServer

img
  • Question: The client is operating a table, and the data is ultimately stored in the RegionServer. What is the relationship between the table and the RegionServer?

  • analysis

    • Table: It is a logical object that does not exist physically. It is a concept for users to implement logical operations and is stored in metadata.
      • Similar to files in HDFS
    • RegionServer: is a physical object, a process in Hbase that manages the storage of a machine
      • Similar to DataNode in HDFS
    • Region: The smallest unit of data storage in Hbase
      • Similar to Block in HDFS
      • Is the concept of partition,Each table can be divided into multiple Regions, Realize distributed storage
        • By default, a table has only one partition
      • Each Region is managed by a RegionServer,Region is stored in RegionServer
        • One RegionServer can manage multiple Regions
  • Observation and monitoring

    [External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-JyhZT7gp-1616633798601) (20210319_Distributed NoSQL column storage database Hbase (4).assets/image-20210319091350259. png)]

    [External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-u8AXyrI7-1616633798603) (20210319_Distributed NoSQL column storage database Hbase (4).assets/image-20210319091417912. png)]

    [External link image transfer failed. The origin site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-WvMurhSI-1616633798606) (20210319_Distributed NoSQL column storage database Hbase (4).assets/image-20210319091609104. png)]

Knowledge point 05: Storage design: Division rules of Region

image-20210317191202582
  • Question: A table is divided into multiple Regions. What are the rules for the division? Write a piece of data to the table, which Region will this piece of data be written to, and what are the allocation rules?

  • analysis

    • Review: HDFS partition rules

      • The rules for dividing partitions: according to the size, the file is divided into a block every 128M
    • Hbase partitioning rules: range division [according to Rowkey range]

      • Range division: divide the range from the entire -oo ~ +oo interval

        • Each partition will have a range: write to which partition according to which range the Rowkey belongs to

          [startKey,stopKey)
          
          • Front-to-close and back-to-open range
      • Default: when a table is created, there is only one Region

        • Range: -oo ~ +oo
      • Custom: When creating a table, specify how many partitions there are and the range of each partition

        • Give a chestnut: create a table, there are 2 partitions Region
          • region0 : -oo ~ 50
          • region1: 50 ~ +oo
        • Give a chestnut: create a table, there are 4 partitions Region
          • region0 : -oo ~ 30
          • region1: 30 ~ 60
          • region2: 60 ~ 90
          • region3: 90 ~ +oo
    • Data distribution rules: write to which partition according to which range the Rowkey belongs to

      • Give a chestnut: create a table, there are 4 partitions Region
        • region0 : -oo ~ 30
        • region1: 30 ~ 60
        • region2: 60 ~ 90
        • region3: 90 ~ +oo
      • Rowkey of written data: comparison is based on ASC code, not numerical value comparison
        • 10 => Region0
        • 1000 => Region0
        • 033 => Region0
        • 789999999 => Region2
        • 91 => Region3
  • Observation and monitoring

    • Only 1 partition

      [External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-fSsgCtLA-1616633798607) (20210319_Distributed NoSQL column storage database Hbase (4).assets/image-20210319093313002. png)]

    • The case of multiple partitions

      create 'itcast:t3','cf',SPLITS => ['20', '40', '60', '80']
      

      [External link image transfer failed. The origin site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-eY7aX5N6-1616633798608) (20210319_Distributed NoSQL column storage database Hbase (4).assets/image-20210319093457846. png)]

      [External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-nARHgEC6-1616633798608) (20210319_Distributed NoSQL column storage database Hbase (4).assets/image-20210319093602323. png)]

      put 'itcast:t3','0300000','cf:name','laoda'   =>  -oo ~ 20
      put 'itcast:t3','7890000','cf:name','laoda'   =>  60 ~ 80
      

      [External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-SwnDDMT6-1616633798609) (20210319_Distributed NoSQL column storage database Hbase (4).assets/image-20210319093903377. png)]

Knowledge point 06: Storage design: Region internal storage structure

[External link image transfer failed. The origin site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-PMqZdtFr-1616633798610) (20210319_Distributed NoSQL column storage database Hbase (4).assets/image-20210317191716413. png)]

  • Question: How is the data stored inside the Region?

  • analysis

    • Table【Logic】/ RegionServer【Physical】
      • Region: Partition level,According to the rowkey to determine which partition belongs to, which Region is written
        • Store: Storage layer, the division of the internal data storage of each partition, according to the column family division, a column family corresponds to a Store
          • Each column family corresponds to a StoreData of different column families are stored in different Stores
          • If a table has 2 column families, there will be two Stores in the region of this table
          • Advantages: divide different data storage
            • Suppose there are 100 columns. If there is no column family, the 100 columns are stored together. If you want to query one of the columns, you will compare 100 times at most
            • Assuming there are 100 columns, if there are two column families, 50 columns are stored together, and I want to query a certain column in a column family, and compare it up to 51 times
              • First compare which Store I want to query, compare once
    • Storage in the store
      • A MemStore: The memory area in the Region, a part of which will be allocated for each Store
        • Data read and write MemStore first
      • 0 or more StoreFile files: Data files in the Store. If the Memstore storage reaches the threshold, the memory data will be written to HDFS
        • StoreFile: logically belongs to Store
          • Physically stored on HDFS: essentially storing HFILE: ordered binary files
  • to sum up

    • RegionServer: Region is stored in Regionserver
      • Region: There are multiple Regions in a table, and determine which region to write according to Rowkey
        • There are multiple column families in a table, each column family corresponds to a Store, and there are multiple stores in a region
        • Store: Determine which Store to write according to the column family
          • memstore: memory area
          • storefile: file on HDFS

Knowledge Point 07: Storage Design: Storage Structure in HDFS

[External link image transfer failed. The origin site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-fLHCvTK8-1616633798610) (20210319_Distributed NoSQL column storage database Hbase (4).assets/image-20210317191754182. png)]

  • Question: How is Hbase's data stored in HDFS?

  • analysis

    • The storage directory of the entire Hbase in HDFS: determined by the configuration

      hbase.rootdir=hdfs://node1:8020/hbase
      

      [External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-LE98p1w6-1616633798611) (20210319_Distributed NoSQL column storage database Hbase (4).assets/image-20210319103050484. png)]

    • Namespace directory

      [External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-j0kiin6w-1616633798611) (20210319_Distributed NoSQL column storage database Hbase (4).assets/image-20210319103216678. png)]

    • Table of contents

      [External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-MgYZIbnt-1616633798612) (20210319_Distributed NoSQL column storage database Hbase (4).assets/image-20210319103307682. png)]

    • Directory of Region

      [External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-BYjAQ9yu-1616633798612) (20210319_Distributed NoSQL column storage database Hbase (4).assets/image-20210319103437641. png)]

    • Store catalog: divided by column family

      [External link image transfer failed. The origin site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-cNt2Oxbn-1616633798612) (20210319_Distributed NoSQL column storage database Hbase (4).assets/image-20210319103636158. png)]

    • Observe the data

      [External link image transfer failed. The origin site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-OwP9fqL8-1616633798613) (20210319_Distributed NoSQL column storage database Hbase (4).assets/image-20210319103835828. png)]

        hbase> flush 'TABLENAME'
        hbase> flush 'REGIONNAME'
        hbase> flush 'ENCODED_REGIONNAME'
        hbase> flush 'REGION_SERVER_NAME'
      
      #强制将内存的数据写入HDFS变成StoreFile
      flush 'itcast:t3'
      

      [External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-sNA9vzoZ-1616633798613) (20210319_Distributed NoSQL column storage database Hbase (4).assets/image-20210319104124012. png)]

Knowledge Point 08: Hot Issues: Phenomenon and Causes

  • Phenomenon: In a certain period of time, a large number of read and write requests are all concentrated in a certain Region, causing the load of this RegionServer to be relatively high, and other Regions and RegionServers are relatively idle
  • Problem: The probability of failure of this RegionServer will increase, the overall performance will decrease, and the efficiency will be relatively poor.
  • Reason: The essential reason is that the data is not evenly distributed
    • Case 1: If this table has only one partition
      • region0 : -oo ~ + oo
      • All data is read and written to this partition
      • Hot spots will definitely appear
    • Case 2: If this table has multiple partitions, if your Rowkey is written consecutively
      • region0 : -oo ~ 30
      • region1:30 ~ 70
      • region2:70 ~ +oo
      • Rowkey written
        • 00000001
        • 00000002
        • 00000003
        • ……
        • 29999999: all writing to Region0
        • 30000000: start writing to region1
        • ……
        • Write all region1
        • ……
      • Reason: The design of the region is ordered according to the scope, and the Rowkey is also ordered, and consecutive rowkeys are written to the same region
  • solve
    • Given multiple partitions when creating a table
    • Rowkey cannot be continuous when writing

Knowledge point 09: Distributed design: pre-partitioning

  • Requirements: When creating a table, specify a table to have multiple Region partitions

  • analysis

    • Divide multiple partitions to achieve distributed parallel read and write, divide the infinite interval into several segments

    • The basis of segment division: Rowkey or Rowkey prefix to divide

    • For example: if the partitions are divided according to the value

      • region0 : -oo ~ 3

      • region1:3 ~ 7

      • region2:7 ~ +oo

      • Rowkey: all start with a letter

        a143434343
        ydfjdkfjd4
        ……
        
      • A hot issue has occurred

  • achieve

    • Method 1: Specify the partition segment to achieve pre-partitioning

      create 'ns1:t1', 'f1', SPLITS => ['10', '20', '30', '40']
      #将每个分割的段写在文件中,一行一个
      create 't1', 'f1', SPLITS_FILE => 'splits.txt'
      
    • Method 2: Specify the number of Regions, and automatically perform Hash division: a combination of letters and numbers

      #你的rowkey的前缀可能是字母可能是数字
      create 'itcast:t4', 'f1', {
              
              NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}
      

      [External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-gEmrClkI-1616633798614) (20210319_Distributed NoSQL column storage database Hbase (4).assets/image-20210319111755041. png)]

    • Method 3: Java API

      HBASEAdmin admin = conn.getAdmin
      admin.create(表的描述器对象,byte[][] splitsKey)
      
  • to sum up

    • Principle: The partition must be designed according to rowkey or rowkey prefix
    • Method: any one of the three methods as long as it can meet the needs

Knowledge point 10: Hbase table design: Rowkey design

  • problem

    • Rowkey function
      • Unique identifier
      • As the only index of Hbase
      • Identifies the region to be divided, not continuous
    • problem
      • Rowkey does not repeat
      • Try to ensure that querying data according to rowkey is the fastest
      • Rowkey is hashed, unordered
  • Requirements: According to business needs, to design rowkey reasonably to achieve high-performance data storage

  • Analysis: Rowkey design is different for tables with different business requirements

    • Design rule

    • Business principle : The design of Rowkey must meet the business requirements, try to achieve data retrieval based on Rowkey, and use the index

      • Rowkey is the only index in Hbase, and the query according to Rowkey is implemented according to prefix matching
      • If you do not search according to Rowkey, it is a full table scan
      • Try to use the most commonly used query conditions as the prefix of Rowkey
        • area
        • time
        • ID: order ID, user ID
    • Unique principle : Rowkey must uniquely identify a piece of data, and Rowkey cannot be repeated

    • Combination principle : Try to combine the most commonly used query conditions as Rowkey, and the most commonly used one as the prefix

      • Common conditions: time, order id, user id

      • Most commonly used: time

      • Rowkey

        1616124723000_user001_order001
        
        • The following queries are all indexed
          • Query all order data at a certain time
          • Query all order data of a certain person at a certain time
          • Query a certain order data of a certain person at a certain time
        • The following query does not take the index
          • Query all orders of a certain person
          • Can only take a full table scan: slow
          • Problem: learn the secondary index to solve
    • Hashing principle : Rowkey cannot be generated in an orderly manner. Due to the division of the range of Regions, if the rowkey is ordered, it will cause hot spots, and an unordered Rowkey must be constructed.

      • Question: Ordered rowkey, using time as the prefix, time is ordered

        1616124723000_user001_order001
        1616124723000_user002_order002
        1616124723000_user003_order003
        1616124723001_user001_order004
        1616124723002_user001_order005
        1616124723000_user001_order001
        ……
        
      • solve

        • Solution 1: Do not use continuous fields as prefixes, for example: use user id as prefix

          user001_1616124723000_order001
          user999_1616124723000_order002
          
          • It is impossible for a user to place multiple orders in the same second
          • It is impossible for all users to be continuous in the same second
          • Disadvantages: prefix is ​​not the most commonly used query condition
        • Solution 2: Based on the orderly reversal, the time is reversed to construct a Rowkey

          0003274216161_user001_order001
          0003274216161_user002_order002
          0003274216161_user003_order003
          1003274216161_user001_order004
          1003274216161_user001_order005
          2003274216161_user001_order001
          3003274216161
          4003274216161
          ……
          
          • Disadvantages: each time you query, you must first reverse and then query
        • Scheme 3: Coding, random construction

          1616124723000_user001_order001
          1616124723000_user002_order002
          1616124723000_user003_order003
          1616124723001_user001_order004
          1616124723002_user001_order005
          1616124723000_user001_order001
          ……
          
          |  通过对原来的rowkey进行加密,得到一个加密码
          
          12345678_1616124723000_user001_order001
          37483784_1616124723000_user002_order002
          ……
          
          • Disadvantages: every time you query, you must also code first
        • Option 4: Add salt and add a random value to the Rowkey prefix

          1616124723000_user001_order001
          1616124723000_user002_order002
          1616124723000_user003_order003
          1616124723001_user001_order004
          1616124723002_user001_order005
          1616124723000_user001_order001
          ……
          
          |	新的rowkey前缀是个随机值【0-9】
          
          0_1616124723000_user001_order001
          4_1616124723000_user002_order002
          9_1616124723000_user003_order003
          0_1616124723001_user001_order004
          0_1616124723002_user001_order005
          1_1616124723000_user001_order001
          
          • Disadvantages: every time you query, you try one by one, and the performance of reading is reduced
    • Length principle : The length of Rowkey is not recommended to be too long

      • Principle: as short as possible if it can meet business needs
      • Problem: If the rowkey is longer, the index occupies more space, the comparison rowkey is slower, and the performance is worse
        • Rowkey is redundantly stored at the bottom
      • Recommendation: no more than 100 bytes
        • If it exceeds 100 bytes, it is recommended to encode
          • 100 bits => MD5 => 32 bits, 16 bits
  • to sum up

    • Business principle: ensure that the prefix is ​​the most commonly used query field
    • The only principle: each rowkey represents a piece of data
    • Combination principle: commonly used query conditions as Rowkey
    • Hashing principle: rowkey construction is not continuous
    • Length principle: meet business needs as short as possible

9】

    0_1616124723000_user001_order001
    4_1616124723000_user002_order002
    9_1616124723000_user003_order003
    0_1616124723001_user001_order004
    0_1616124723002_user001_order005
    1_1616124723000_user001_order001
    ```

    - 缺点:每次查询时,挨个试,读的性能降低
  • Length principle : The length of Rowkey is not recommended to be too long

    • Principle: as short as possible if it can meet business needs
    • Problem: If the rowkey is longer, the index occupies more space, the comparison rowkey is slower, and the performance is worse
      • Rowkey is redundantly stored at the bottom
    • Recommendation: no more than 100 bytes
      • If it exceeds 100 bytes, it is recommended to encode
        • 100 bits => MD5 => 32 bits, 16 bits
  • to sum up

    • Business principle: ensure that the prefix is ​​the most commonly used query field
    • The only principle: each rowkey represents a piece of data
    • Combination principle: commonly used query conditions as Rowkey
    • Hashing principle: rowkey construction is not continuous
    • Length principle: meet business needs as short as possible

Guess you like

Origin blog.csdn.net/xianyu120/article/details/115194565