Thanks for sharing the platform- http://bjbsair.com/2020-04-10/tech-info/53339.html
This article introduces you to one of the HBase mode cases: log data and time series data
Assume that the following data elements are being collected.
- Hostname (Hostname)
- Timestamp
- Log event
- Value / message
We can store them in an HBase table named LOG_DATA, but what will rowkey be? From these attributes, rowkey will be some combination of hostname, timestamp, and log events, but what exactly?
Timestamp in the dominant position of Rowkey
rowkey [timestamp] [hostname] [log-event] is affected by the monotonically increasing rowkey problem described in Monotonically Increasing Row Keys / Timeseries Data.
By performing mod operations on timestamps, another mode is often mentioned in dist-lists about "bucketing" timestamps. If time scanning is important, this may be a useful method. You must pay attention to the number of buckets, because this requires the same number of scans to return the results.
structure:
As mentioned above, to select data for a specific time range, you need to perform a scan for each storage bucket. For example, 100 storage buckets will provide a wide distribution in the key space, but it requires 100 scans to get a single timestamp of data, so there are trade-offs.
Host in the dominant position of Rowkey
If there are a large number of hosts writing and reading in the entire key space, rowkey [hostname] [log-event] [timestamp] is a candidate. This method is very useful if scanning by host name is a priority.
Time stamp or reverse time stamp
If the most important access path is to pull the most recent event, storing the timestamp as a reverse timestamp (for example, timestamp = Long.MAX_VALUE – timestamp) will create the ability to scan [hostname] [log-event] to obtain The attributes of the most recently captured event.
Neither method is wrong, it only depends on what is the most suitable situation.
Reverse scan API
HBASE-4811 implements an API that scans a table or a range of tables in reverse, thereby reducing the need for pattern optimization for forward or reverse scanning. This feature is available in HBase 0.98 and later.
Variable-length or fixed-length row keys
Remember, it is very important to add a row password to each column of HBase. If the host name is a and the event type is e1, then the result rowkey will be small. But what if the ingested host name is myserver1.mycompany.com and the event type is com.package1.subpackage2.subsubpackage3.ImportantService?
It might make sense to use some substitutions in rowkey. There are at least two methods: hash and number. In the host name in the Rowkey Lead Position example, it might look like this:
Compound Rowkey with hash:
- [MD5 hash of host name] = 16 bytes ([MD5 hash of hostname] = 16 bytes)
- [MD5 hash of event type] = 16 bytes ([MD5 hash of event-type] = 16 bytes)
- [Timestamp] = 8 bytes ([timestamp] = 8 bytes)
Compound Rowkey with numeric substitution:
For this method, in addition to LOG_DATA (called LOG_TYPES), another lookup table is required. The rowkey of LOG_TYPES is:
- [type], (for example, a byte indicating the host name and event type).
- [bytes], variable-length bytes of the original host name or event type.
The column of this rowkey may be a long integer with the specified number, which can be obtained by using the HBase counter.
So the resulting composite rowkey will be:
- [Substitute host name long] = 8 bytes ([substituted long for hostname] = 8 bytes)
- [Long-term replacement event type] = 8 bytes ([substituted long for event type] = 8 bytes)
- [Timestamp] = 8 bytes ([timestamp] = 8 bytes)
In the Hash or Numeric replacement method, the original values of the host name and event type can be stored as columns.