HBase mode case log data and time series data

Thanks for sharing the platform- http://bjbsair.com/2020-04-10/tech-info/53339.html

This article introduces you to one of the HBase mode cases: log data and time series data

Assume that the following data elements are being collected.

  • Hostname (Hostname)
  • Timestamp
  • Log event
  • Value / message

We can store them in an HBase table named LOG_DATA, but what will rowkey be? From these attributes, rowkey will be some combination of hostname, timestamp, and log events, but what exactly?

Timestamp in the dominant position of Rowkey

rowkey [timestamp] [hostname] [log-event] is affected by the monotonically increasing rowkey problem described in Monotonically Increasing Row Keys / Timeseries Data.

By performing mod operations on timestamps, another mode is often mentioned in dist-lists about "bucketing" timestamps. If time scanning is important, this may be a useful method. You must pay attention to the number of buckets, because this requires the same number of scans to return the results.

HBase mode case: log data and time series data

structure:

HBase mode case: log data and time series data

As mentioned above, to select data for a specific time range, you need to perform a scan for each storage bucket. For example, 100 storage buckets will provide a wide distribution in the key space, but it requires 100 scans to get a single timestamp of data, so there are trade-offs.

Host in the dominant position of Rowkey

If there are a large number of hosts writing and reading in the entire key space, rowkey [hostname] [log-event] [timestamp] is a candidate. This method is very useful if scanning by host name is a priority.

Time stamp or reverse time stamp

If the most important access path is to pull the most recent event, storing the timestamp as a reverse timestamp (for example, timestamp = Long.MAX_VALUE – timestamp) will create the ability to scan [hostname] [log-event] to obtain The attributes of the most recently captured event.

Neither method is wrong, it only depends on what is the most suitable situation.

Reverse scan API

HBASE-4811 implements an API that scans a table or a range of tables in reverse, thereby reducing the need for pattern optimization for forward or reverse scanning. This feature is available in HBase 0.98 and later.

Variable-length or fixed-length row keys

Remember, it is very important to add a row password to each column of HBase. If the host name is a and the event type is e1, then the result rowkey will be small. But what if the ingested host name is myserver1.mycompany.com and the event type is com.package1.subpackage2.subsubpackage3.ImportantService?

It might make sense to use some substitutions in rowkey. There are at least two methods: hash and number. In the host name in the Rowkey Lead Position example, it might look like this:

Compound Rowkey with hash:

  • [MD5 hash of host name] = 16 bytes ([MD5 hash of hostname] = 16 bytes)
  • [MD5 hash of event type] = 16 bytes ([MD5 hash of event-type] = 16 bytes)
  • [Timestamp] = 8 bytes ([timestamp] = 8 bytes)

Compound Rowkey with numeric substitution:

For this method, in addition to LOG_DATA (called LOG_TYPES), another lookup table is required. The rowkey of LOG_TYPES is:

  • [type], (for example, a byte indicating the host name and event type).
  • [bytes], variable-length bytes of the original host name or event type.

The column of this rowkey may be a long integer with the specified number, which can be obtained by using the HBase counter.

So the resulting composite rowkey will be:

  • [Substitute host name long] = 8 bytes ([substituted long for hostname] = 8 bytes)
  • [Long-term replacement event type] = 8 bytes ([substituted long for event type] = 8 bytes)
  • [Timestamp] = 8 bytes ([timestamp] = 8 bytes)

In the Hash or Numeric replacement method, the original values ​​of the host name and event type can be stored as columns.

Guess you like

Origin blog.51cto.com/14744108/2486399