Manipulating hbase using MR programming ======================================= 1. TableInputFormat input K, V format ImmutableBytesWritable // equivalent to the offset in textInputFormat Result // real data Use conf to set the table configuration file conf.set(TableInputFormat.INPUT_TABLE, "ns1:t1"); // Specify the table name // You need to manually add conf.set("hbase.zookeeper.quorum","s102:2181,s103:2181"); / / Specify the zk connection address 2. TableOutputFormat outputs K, V format ignore // When specifying value, this item can be ignored, it is recommended to use NullWritable Put || Delete // put or delete of hbase Use conf to set the table configuration file conf.set(TableOutputFormat.OUTPUT_TABLE, "ns1:wc"); // Specify the table name // You need to manually add conf.set("hbase.zookeeper.quorum","s102:2181,s103:2181"); / / Specify the zk connection address Bloom filter ============================ is the file format of hbase, stored in the form of k/v, and k/ v are byte arrays HFile includes the following: Storage space for reading or writing compressed blocks.Compression codec for the I / O operations specified by each block temporary key storage temporary value store hfile index, exists in memory, takes up about ( 56+AvgKeySize)* NumBlocks. Performance Optimization Suggestions **** Minimum block size, recommended between 8KB to 1MB Sequential reads and writes are recommended for large blocks, but are not convenient for random access (because more data needs to be decompressed) Small blocks are convenient for random reading and writing, but they take up more memory, but are slower to create (because there are many blocks, each compression requires a flush operation) Due to the compressed cache, the minimum block size should be between 20KB - 30KB. The index in the hfile, which is loaded into memory each time the region is loaded region:folder /cf:folder --------> HFile When querying, all hfile indexes in the cf folder will be searched by LSM tree traversal (approximately binary search), So when searching, it will traverse all the indexes In order to solve this problem, the Bloom filter can immediately determine that the file does not have a specified rowKey. Helps filter out some files that don't need to be scanned Coarse granularity than block index Therefore, when hbase locates the rowKey, it first excludes some hfiles that definitely do not exist through the Bloom filter. Then traverse the data through the block index in the remaining files that may exist in the hfile. Bloom filter configuration: BLOOMFILTER NONE // Not applicable, does not occupy ROW // It is recommended to scan only row-level operations, which take up less resources ROWCOL // It is recommended to scan row+column-level operations, which takes up slightly more resources alter 'ns1:t1', NAME => f1, BLOOMFILTER => ROWCOL HBase tuning: ================================= 1. Adjust the new generation heap memory size export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -XX:PermSize=128m -XX:MaxPermSize=128m" 2. Configure to reduce memory fragmentation caused by garbage collection hbase.hregion.memstore.mslab.enabled ===> true 3. Compression is used // compression can only be used on empty tables alter 'ns1:t1', NAME => 'f1', COMPRESSION => 'LZ4' 4. Optimize split and merge: split: Avoid cutting storms // The region defaults to 10G. When all tables grow to the specified threshold at the same time, they will be cut at the same time. // Greatly affect cluster performance . pre-cut Avoid hot data: // The design principle of rowKey: scattered in the cluster, continuous in the area 1 , composite key 2 , adjust the weight of the composite key 3, salting out: random salt x hash salt - manual design prefix: in the cluster range Scatter, continuous 4 in the area , compare numbers, use MAX_VALUE- num for inversion 5 , format the number string DecimalFormat To move the area manually: merge:merge_region 'ENCODED_REGIONNAME', 'ENCODED_REGIONNAME' 5. Load balancing: // More than half of the regions are governed by a regionServer, resulting in excessive pressure close_region // Close the region assign // Register the region unassign // Unregister the current regionserver, and re-register the balancer on other nodes // The balancer, all unassign region and re-register 6 , API 1 , turn off automatic flashing table.setAutoFlush(false, false); 2. Set scan cache || batch scan.setCaching() 3. Limit the scan range // avoid full table scan 4. Close resultScanner // Affect performance issues rs.close() 5. Scan setting fast cache // default true to cache the scan results to the client for next use 6. Set the filter: RowFilter FamilyFilter QualifierFilter ValueFilter SingleColumnValueFilter FilterList 7. The put setting prohibits writing to WAL // The life cycle of WAL is not recommended put.setDurability(Durability.SKIP_WAL); 7. Configuration 1. Increase the processing thread hbase.regionserver.handler.count 2. Increase heap memory: hbase-env.sh export HBASE_HEAPSIZE =1G // default 1G 3. Adjust the cache size hfile.block.cache.size =0.4 // 40% of heap memory 4. Adjust the size of the memstore hbase.regionserver.global.memstore.size // maximum memstore, default heap memory 40% hbase.regionserver.global.memstore.size.lower.limit // minimum memstore, default maximum memstore 95%