A note about Hbase multi-version storage


We know that hbase is a multi-version management system. Before version 0.96, each column had 3 versions by default. After hbase 0.96, each column had 1 version. The so-called version is actually the same data inserted with different timestamps To achieve this, the underlying storage in hbase is sorted based on timestamps, so every time we find the data is the latest version, unless we specify that we want to read data in a specific time range.



Let's take a look at the APIs of the Put and Delete commands in Hbase:

Put:
````
Put(byte[] row)
Put(byte[] row, long ts)
Put(byte[] rowArray, int rowOffset, int rowLength)
Put(byte[] rowArray, int rowOffset, int rowLength, long ts)
Put(ByteBuffer row)
Put(ByteBuffer row, long ts)
Put(Put putToCopy)

````


Delete:
````
Delete(byte[] row)
Delete(byte[] row, long timestamp)
Delete(byte[] rowArray, int rowOffset, int rowLength)
Delete(byte[] rowArray, int rowOffset, int rowLength, long ts)
Delete(Delete d)
````



As above, the commonly used pu and delete methods are basically the first. By default, after we use the Put command to insert a piece of data, its timestamp is the current timestamp. Of course, we can also set the timestamp by ourselves, but I suggest not just casually Setting this timestamp, the wrong setting may cause some inexplicable problems. I just said that hbase sorts in descending order of time when reading, and every time it reads it is the latest. Set this timestamp to Long.MAX_VALUE, then you do not pass in the timestamp when inserting, deleting or updating, then you will be surprised to find that inserting, deleting, and updating all fail, why? Because the timestamp of your operation is less than Long.MAX_VALUE, and you have only one version, hbase thinks that an old version cannot overwrite the new version, and the same is true for deletion. You will find that no matter how many times you execute the delete command, This piece of data cannot be deleted.



Note that the second methods of Put and Delete in the above api are timestamped. Don't misunderstand, this timestamp is not rowkey, it is used for the following column, that is to say, if inserting A row of data, there are multiple column clusters in this row of data, if there are multiple columns under each column cluster, and their timestamps are the same, then I can directly specify it in the second parameter of put without needing It is specified on each column. Of course, if we also specify a timestamp on the column, the timestamp on the column will be used first by default.







Summary:

The multi-version storage feature of hbase is a powerful feature. When using it, you should be careful not to modify the logic of taking the current timestamp by default. If you modify it, you should consider the current time when adding, deleting, and updating. Whether the stamp is greater than the timestamp of the first insertion, if not, then this modification will not take effect, so one day when you delete a row of hbase data, you find that it has not been deleted, don't be surprised, there is no problem in the code In this case, the biggest possibility is that the current timestamp is less than the timestamp of the data in the library. This needs special attention. Repeat it again at the end. Try not to set a custom timestamp when inserting data into hbase, unless the business scenario requires it. .

If you have any questions, you can scan the code and follow the WeChat public account: I am the siege division (woshigcs), leave a message in the background for consultation. Technical debts cannot be owed, and health debts cannot be owed. On the road of seeking the Tao, walk with you.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326036426&siteId=291194637