Programming hbase and hbase tuning using MR - bloom filter

Manipulating hbase using MR programming
 =======================================

    1. TableInputFormat input K, V format
        ImmutableBytesWritable     // equivalent to the offset in textInputFormat 
        Result             // real data


        Use conf to set the table configuration file
            conf.set(TableInputFormat.INPUT_TABLE, "ns1:t1");     // Specify the table name
             // You need to manually add 
            conf.set("hbase.zookeeper.quorum","s102:2181,s103:2181");     / / Specify the zk connection address

    2. TableOutputFormat outputs K, V format
        ignore         // When specifying value, this item can be ignored, it is recommended to use NullWritable 
        Put || Delete     // put or delete of hbase


        Use conf to set the table configuration file
            conf.set(TableOutputFormat.OUTPUT_TABLE, "ns1:wc");     // Specify the table name              
             // You need to manually add                                                                   
            conf.set("hbase.zookeeper.quorum","s102:2181,s103:2181");     / / Specify the zk connection address




Bloom filter
============================ 
    is the file format of hbase, stored in the form of k/v, and k/ v are byte arrays

    HFile includes the following:
        Storage space for reading or writing compressed blocks.Compression codec for the I / O operations 
        specified by each block
        temporary key storage
        temporary value store
        hfile index, exists in memory, takes up about ( 56+AvgKeySize)* NumBlocks.
        Performance Optimization Suggestions
    ****     Minimum block size, recommended between 8KB to 1MB
            Sequential reads and writes are recommended for large blocks, but are not convenient for random access (because more data needs to be decompressed)
            Small blocks are convenient for random reading and writing, but they take up more memory, but are slower to create (because there are many blocks, each compression requires a flush operation)
            Due to the compressed cache, the minimum block size should be between 20KB - 30KB.


    The index in the hfile, which is loaded into memory each time the region is loaded

    region:folder /cf:folder --------> HFile

    When querying, all hfile indexes in the cf folder will be searched by LSM tree traversal (approximately binary search),
    So when searching, it will traverse all the indexes

    In order to solve this problem, the Bloom filter can immediately determine that the file does not have a specified rowKey. Helps filter out some files that don't need to be scanned
    Coarse granularity than block index

    Therefore, when hbase locates the rowKey, it first excludes some hfiles that definitely do not exist through the Bloom filter.
                     Then traverse the data through the block index in the remaining files that may exist in the hfile.


    Bloom filter configuration: BLOOMFILTER
        NONE         // Not applicable, does not occupy 
        ROW         // It is recommended to scan only row-level operations, which take up less resources 
        ROWCOL         // It is recommended to scan row+column-level operations, which takes up slightly more resources


    alter 'ns1:t1', NAME => f1, BLOOMFILTER => ROWCOL



HBase tuning:
=================================
    1. Adjust the new generation heap memory size
        export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -XX:PermSize=128m -XX:MaxPermSize=128m"

    2. Configure to reduce memory fragmentation caused by garbage collection
        hbase.hregion.memstore.mslab.enabled ===> true    

    3. Compression is used     // compression can only be used on empty tables
        alter 'ns1:t1', NAME => 'f1', COMPRESSION => 'LZ4'

    4. Optimize split and merge:
        split: Avoid cutting storms         // 
                                The region defaults to 10G. When all tables grow to the specified threshold at the same time, they will be cut at the same time.
                         // Greatly affect cluster performance
                         . pre-cut
               
               Avoid hot data:         // The design principle of rowKey: scattered in the cluster, continuous in the area 
                                       1 , composite key
                                 2 , adjust the weight of the composite key
                                 3, salting out: random salt x hash salt -   manual design prefix: in the cluster range Scatter, continuous 4 in the area
                                 , compare numbers, use MAX_VALUE- num for inversion
                                 5 , format the number string DecimalFormat
               To move the area manually:



                    
        merge:merge_region 'ENCODED_REGIONNAME', 'ENCODED_REGIONNAME' 

    5. Load balancing:     // More than half of the regions are governed by a regionServer, resulting in excessive pressure 

        close_region     // Close the region 
        assign         // Register the region 
        unassign     // Unregister the current regionserver, and re-register the 

        balancer     on other nodes // The balancer, all unassign region and re-register


    6 , API
         1 , turn off automatic flashing
            table.setAutoFlush(false, false);

        2. Set scan cache || batch
            scan.setCaching()

        3. Limit the scan range         // avoid full table scan

        4. Close resultScanner     // Affect performance issues 
            rs.close()

        5. Scan setting fast cache     // default true 
            to cache the scan results to the client for next use

        6. Set the filter:        
            RowFilter
            FamilyFilter
            QualifierFilter
            ValueFilter

            SingleColumnValueFilter

            FilterList

        7. The put setting prohibits writing to WAL     // The life cycle of WAL is not recommended 
            put.setDurability(Durability.SKIP_WAL);

    7. Configuration
         1. Increase the processing thread
            hbase.regionserver.handler.count

        2. Increase heap memory: hbase-env.sh
            export HBASE_HEAPSIZE =1G     // default 1G
            
        3. Adjust the cache size
            hfile.block.cache.size =0.4     // 40% of heap memory

        4. Adjust the size of the memstore
            hbase.regionserver.global.memstore.size             // maximum memstore, default heap memory 40% 
            hbase.regionserver.global.memstore.size.lower.limit     // minimum memstore, default maximum memstore 95%

        

            
        






    

        

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326393433&siteId=291194637