HBase mode case study list data

Thanks for sharing the platform- http://bjbsair.com/2020-04-10/tech-info/53341.html

The following is a communication about a fairly common problem in the user dist-list: how to handle each user list data in Apache HBase.

problem:

We are studying how to store large amounts (per user) of list data in HBase, and we are trying to figure out which access mode makes the most sense. One option is to store most of the data in a key, so we can have the following:

HBase mode case study: list data

Our other option is to use the following completely:

HBase mode case study: list data

Each row will contain multiple values. So in one case, reading the first thirty values ​​would be:

HBase mode case study: list data

In the second case, it would be like this:

HBase mode case study: list data

The general usage pattern is to only read the first 30 values ​​of these lists, and few visits will go deep into the list. Some users will have a total of 30 values ​​in these lists, and some users will have millions (ie, power-law distribution).

The single-value format seems to take up more space on HBase, but will provide some improved retrieval / pagination flexibility. Are there any significant performance advantages that can be paginated by fetched and scanned pages?

My initial understanding was that if our page size is unknown (and the cache setting is appropriate), then performing the scan should be faster, but if we always need the same page size, the scan speed should be faster. I heard different people tell me the opposite about performance. I assume that the page size will be relatively consistent, so for most use cases, we can guarantee that we only need one page of data with a fixed page length. I will also assume that we will not update frequently, but may be inserted in the middle of these lists (which means we need to update all subsequent rows).

answer:

If I understand correctly, you end up trying to store the triplet in the form of "user, valueid, value", right? E.g:

HBase mode case study: list data

(But the username is a fixed width, and the valueids are a fixed width).

Moreover, your access mode meets the following requirements: "For user X, list the next 30 values, starting with valueid Y". Is it right? Should these values ​​be sorted by valueid and returned?

The tl and dr versions are that you should probably add a row for each user + value. Unless you are sure of the need, do not build a complex in-row paging solution yourself.

Your two options reflect the common questions people have when designing HBase mode: Should I choose "height" or "width"? Your first pattern is "tall": each row represents a value for a user, so there are many rows in each user's table; the row key is user + valueid, and there will be (possibly) a single column limit Symbol, which means "value". This is good if you want to press the row key to scan the rows in the sort order. You can start scanning at any user + valueid, read the next 30, and finish. What you give up is the ability to provide transaction guarantees around all rows of a user, but it doesn't sound like what you need.

The second option is "wide": use a different qualifier (where the qualifier is valueid) to store a bunch of values ​​in a row. The simple approach is to store all the values ​​of a user in one row. I guess you jumped to the "paginated" version because you think storing millions of columns in a single row will affect performance, which may or may not be true; as long as you don't want to do too much in a single request, or do Some things, such as scanning and returning all the cells in a row, it shouldn't be bad at all. The client has methods that allow you to get fragments of specific columns.

Please note that neither of these cases will fundamentally take up more disk space; you just "move" part of the identification information to the left (in the row key, in option one) or to the right (in option 2 In the column qualifier). Under the cover, each key / value still stores the entire row key and column name.

As you have noticed, the manual pagination version has a lot of complexity, such as having to track how much content is on each page, reshuffling if new values ​​are inserted, etc. This looks much more complicated. At extremely high throughput it may have some slight speed advantages (or disadvantages!), And the only way to really know this is to try it. If you do n’t have time to build it and compare it, my suggestion is to start with the simplest option (each user + valueid). Start simple and repeat!

Guess you like

Origin blog.51cto.com/14744108/2486400