1. Shell query
The hbase query is quite simple, providing two methods of get and scan, and there is no problem of multi-table joint query. For complex queries, you need to create corresponding external tables through Hive , and use SQL statements to automatically generate mapreduce for execution.
But this simplicity, and sometimes in order to achieve the goal, is not so easy. At least it is quite different from the sql query method.
hbase provides many filters to filter on row keys, columns, and values. The filtering method can be substring, binary, prefix, regular comparison, etc. Conditions can be combinations of AND, OR, etc. Therefore, through filtering, it is still possible to meet the needs and find the correct results.
1.1 Filter Types
There is a description of the filter in the latest HBase official document Chinese version (http://abloz.com/hbase/book.html) . Filters are divided into 5 types:
- Stereotype filter: A filter used to contain another set of filters. Include: FilterList
- Column value type filter: filter the value of each column. Equivalent to = and like in sql query, including:
SingleColumnValueFilter
Comparators, including: RegexStringComparator Regular expressions that support value comparison SubstringComparator is used to detect whether a substring exists in a value. Not case sensitive. BinaryPrefixComparator Binary Prefix Compare BinaryComparator Binary Compare
- Key-value metadata filter: used to filter the column. include:
FamilyFilter is used to filter column families. In general, selecting ColumnFamilie in Scan is better than doing it in Filter. QualifierFilter is used to filter based on column name (i.e. Qualifier). ColumnPrefixFilter can filter based on column name (ie Qualifier) prefix. MultipleColumnPrefixFilter behaves like ColumnPrefixFilter, but multiple prefixes can be specified. ColumnRangeFilter enables efficient internal scanning.
- Rowkey: Filters on row keys. It is generally considered that the startRow/stopRow method is better for Scan when selecting rows. However RowFilter can also be used.
- Tools: For example , FirstKeyOnlyFilter is used to count the number of rows.
2. Examples
1. FirstKeyOnlyFilter, a convenient filter for calculating the number of rows
hbase(main):002:0> scan 'toplist_ware_ios_1009_201231',{COLUMNS=>'info',FILTER=>"(FirstKeyOnlyFilter())"} 0000000001 column=info:loginid, timestamp=1343625459713, value=jjm168131013 0000000002 column=info:loginid, timestamp=1343625459713, value=loveswh ... 21 row(s) in 0.5480 seconds
2. Filter by column name substring
hbase(main):006:0> scan 'toplist_ware_ios_1009_201231',{COLUMNS=>['info:'],FILTER=>"(QualifierFilter(=,'substring:id'))"} ROW COLUMN+CELL 0000000001 column=info:loginid, timestamp=1343625459713, value=jjm168131013 0000000001 column=info:userid, timestamp=1343625459713, value=168131013 0000000002 column=info:loginid, timestamp=1343625459713, value=loveswh 0000000002 column=info:userid, timestamp=1343625459713, value=100898152 hbase(main):005:0> scan 'toplist_ware_ios_1009_201231',{COLUMNS=>['info:loginid'],FILTER=>"(QualifierFilter(=,'substring:id'))"} ROW COLUMN+CELL 0000000001 column=info:loginid, timestamp=1343625459713, value=jjm168131013 0000000002 column=info:loginid, timestamp=1343625459713, value=loveswh hbase(main):007:0> scan 'toplist_ware_ios_1009_201231',{COLUMNS=>['info:'],FILTER=>"(QualifierFilter(=,'substring:nid'))"} ROW COLUMN+CELL 0000000001 column=info:loginid, timestamp=1343625459713, value=jjm168131013 0000000002 column=info:loginid, timestamp=1343625459713, value=loveswh hbase(main):008:0> scan 'toplist_ware_ios_1009_201231',{COLUMNS=>['info:'],FILTER=>"(QualifierFilter(=,'substring:nick'))"} ROW COLUMN+CELL 0000000001 column=info:nick, timestamp=1343625459713, value=\xE5\xAE\xB6\xE6\x9C\x89\xE8\x99\x8E\xE5\xAE\x9 D 0000000002 column=info:nick, timestamp=1343625459713, value=loveswh08
3.Value filtering
3.1 Regular filtering hbase(main):004:0> scan 'toplist_ware_ios_1009_201231',{COLUMNS=>'info',FILTER=>"(SingleColumnValueFilter('info','nick',=,'regexstring:.*99',true,true))"} ROW COLUMN+CELL 0000000009 column=info:loginid, timestamp=1343625459713, value=zgh1968 0000000009 column=info:nick, timestamp=1343625459713, value=zwy99 0000000009 column=info:score, timestamp=1343625459713, value=5 0000000009 column=info:userid, timestamp=1343625459713, value=100366262 1 row(s) in 0.2520 seconds 3.2 Substrings need to be imported import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes hbase(main):028:0> scan 'toplist_ware_ios_1001_201231',{COLUMNS =>'info:nick', FILTER=>SingleColumnValueFilter.new(Bytes.toBytes('info'),Bytes.toBytes('nick'),CompareFilter::CompareOp.valueOf('EQUAL'),SubstringComparator.new('8888'))} ROW COLUMN+CELL 0000000002 column=info:nick, timestamp=1343625446556, value=\xE7\x81\x8F????\xE3\x81\x8A??8888 1 row(s) in 0.0330 seconds 3.3 Binary Substrings, etc. do not support multi-byte literals, so use binary for comparison hbase(main):010:0> scan 'toplist_ware_ios_1009_201231',{COLUMNS=>['info:'],FILTER=>"(QualifierFilter(=,'substring:nick') AND ValueFilter(=,'binary:7789\xE6\xB4\x81') )"} ROW COLUMN+CELL 0000000016 column=info:nick, timestamp=1343625459713, value=7789\xE6\xB4\x81 1 row(s) in 0.1710 seconds
4 Comprehensive column name substring and value binary comparison
hbase(main):012:0> scan 'toplist_ware_ios_1009_201231',{COLUMNS=>['info:'],FILTER=>"(QualifierFilter(=,'substring:nick') AND ValueFilter(=,'binary:7789\xE6\xB4\x81') )"} ROW COLUMN+CELL 0000000016 column=info:nick, timestamp=1343625459713, value=7789\xE6\xB4\x81 1 row(s) in 0.0120 seconds
hbase(main):014:0> scan 'toplist_ware_ios_1009_201231',{COLUMNS=>"info:",FILTER=>"(PrefixFilter('000000002')) AND (QualifierFilter(=,'substring:nick')"} ROW COLUMN+CELL 0000000020 column=info:nick, timestamp=1343625459713, value=Denny_feng 0000000021 column=info:nick, timestamp=1343625459713, value=\xE5\xB0\x8F\xE7\xBD\x97\xE6\x95\x99\xE7\xBB\x8 31 2 row(s) in 0.0440 seconds
5. Line query
hbase(main):005:0> get 'toplist_ware_ios_1009_201231','0000000009' COLUMN CELL info:loginid timestamp=1343625459713, value=zgh1968 info:nick timestamp=1343625459713, value=zwy99 info:score timestamp=1343625459713, value=5 info:userid timestamp=1343625459713, value=100366262 4 row(s) in 0.1000 seconds
hbase(main):006:0> get 'toplist_ware_ios_1009_201231','0000000009','info:nick' COLUMN CELL info:nick timestamp=1343625459713, value=zwy99 1 row(s) in 0.0100 seconds
hbase(main):009:0> scan 'toplist_ware_ios_1009_201231',FILTER=>"PrefixFilter('000000002')" ROW COLUMN+CELL 0000000020 column=info:loginid, timestamp=1343625459713, value=jjm169212318 0000000020 column=info:nick, timestamp=1343625459713, value=Denny_feng 0000000020 column=info:score, timestamp=1343625459713, value=1 0000000020 column=info:userid, timestamp=1343625459713, value=169212318 0000000021 column=info:loginid, timestamp=1343625459713, value=jjm169371841 0000000021 column=info:nick, timestamp=1343625459713, value=\xE5\xB0\x8F\xE7\xBD\x97\xE6\x95\x99\xE7\xBB\x8 31 0000000021 column=info:score, timestamp=1343625459713, value=1 0000000021 column=info:userid, timestamp=1343625459713, value=169371841 2 row(s) in 0.0180 seconds
hbase(main):010:0> scan 'toplist_ware_ios_1009_201231',FILTER=>"PrefixFilter('000000002')",LIMIT=>1 ROW COLUMN+CELL 0000000020 column=info:loginid, timestamp=1343625459713, value=jjm169212318 0000000020 column=info:nick, timestamp=1343625459713, value=Denny_feng 0000000020 column=info:score, timestamp=1343625459713, value=1 0000000020 column=info:userid, timestamp=1343625459713, value=169212318 1 row(s) in 0.0170 seconds
hbase(main):011:0> scan 'toplist_ware_ios_1009_201231',{COLUMNS=>"info:nick",FILTER=>"PrefixFilter('000000002')",LIMIT=>1} ROW COLUMN+CELL 0000000020 column=info:nick, timestamp=1343625459713, value=Denny_feng 1 row(s) in 0.0160 seconds