How does Redis store hundreds of millions of user states?

Insert picture description here

Some time ago, I saw an interview question online:

How to use redis to store and count the login status of 100 million users in a year, and quickly retrieve the number of active users in any time window.

I found it interesting, so I thought about it carefully. I did a series of experiments and simulated it myself. It's still a bit rewarding, now I sort it out. Share with everyone.

Redis is an in-memory database that uses a single-threaded and event-driven mechanism to process network requests. The actual production of QPS and TPS can reach 3,4W, and the read and write performance is very good. It is also very good to store some user status information that has a weak impact on the core business.

For this question, there are 2 important points to consider:

1. How to use the appropriate data type to store the data of 100 million users, and use ordinary strings to store it is definitely not enough. After checking the memory usage of the simplest kv (key is aaa, value is 1), it is found to be 48 bytes.

Insert picture description here

Assuming that each user logs in every day and needs to occupy 1 pair of KV, that 100 million is (48*100000000)/1024/1024/1024=4.47G. This is still the amount of one day.

2. How to satisfy the search, redis is a memory structure of key-value pairs, which can only locate the value value according to the key, and cannot perform fast full-text retrieval of documents in inverted index like elastic search.

Redis actually has this data structure, which can store a lot of information with very little space.

After redis 2.2.0 version, a new bitmap data is added, but it is not a data structure. In fact, it is a string structure, but value is a binary data, and each bit can only be 0 or 1.

Redis provides a set of commands for bitmap separately. Any bit can be set and read.

The core commands of bitmap:

SETBIT

Syntax: SETBIT key offset value
For example:

setbit abc 5 1 ----> 00001

setbit abc 2 1 ----> 00101

GETBIT

Syntax: GETBIT key offset
For example:

getbit abc 5 ----> 1

getbit abc 1 ----> 0

Other commands of bitmap include bitcount, bitcount, bitpos, bitop and other commands. They are all counterpoint operations.

Because each bit of the bitmap only occupies 1 bit of space, we can use this feature to use each day as the key, and the value is the activity status of 100 million users. Assume that a user is active as long as he logs in once a day. If it is active, it will be recorded as 1, and if it is not active, it will be recorded as 0. The user Id is used as the offset (offset). In this way, one key can store the active status of 100 million users.

Insert picture description here

Let's calculate how much space the value object of such a bitmap structure occupies. Each bit is 1 bit, and 100 million users is 100 million bits. 8bit=1Byte

100000000/8/1024/1024 = 11.92M

I used a test project to stuff 100 million bits into a key through lua, and then used rdb tools to count the memory. The actual measurement is as follows:

Insert picture description here

100 million users a day consumes 12M of memory space. This fully meets the requirements. In 1 year, it is only 4G. After a few years, redis can be deployed in clusters to expand storage. We can also compress and store the bitmap with a bitmap compression algorithm. For example, WAH, EWAH, Roaring Bitmaps. This can be pulled out to chat later.

We store the login status of 100 million users every day in redis in the form of bitmap. To obtain
whether the user with id 88000 is active on a certain day, use the getbit command directly:

getbit 2020-01-01 88000 [Time complexity is O(1)]
If you want to count the number of all active users on a certain day, use the bitcount command, bitcount can count the number of 1, which is the number of active users:

bitcount 2019-01-01 [Time complexity is O(N)]
If you want to count the number of active users in a certain period of time, you need to use the bitop command. This command provides four bit operations, AND (AND), (OR) or, XOR (also or), NOT (not). We can perform OR (or) operation on all keys in a certain period of time, or the bitmap of the operation is 0, which means that the user has not logged in once during this period. As long as we find the number of 1s, it is enough. The following example finds the number of active users from 2019-01-01 to 2019-01-05.

bitop or result 2019-01-01 2019-01-02 2019-01-03 2019-01-04 2019-01-05 [Time complexity is O(N)]
bitcount result

In terms of time complexity, whether it is counting a certain day or counting a period of time. In the actual test, it is basically in seconds. Meet our expectations.

Bitmap can well meet some scenarios that need to record a large amount of simple information. The space occupied is very small. Generally speaking, the usage scenarios are divided into 2 categories:

1. Horizontal expansion of a certain business object, the key is the id of a certain business object, such as recording the function switch of a certain terminal, 1 means on, 0 means off. Basically, it can be expanded infinitely and can record 2^32 bits of information. However, in this usage, because the key has the id of the business object, the storage space of the key is larger than the storage space of the value, and there is a certain optimization space from the perspective of space usage.

2. Vertical expansion of a certain business, the key is a certain business, and the id of each business object is recorded in place as an offset. The example of this interview question is to use this method to solve. Very clever use of the user's id as the offset to find the corresponding value. When the number of business objects exceeds 2^32 (approximately 4.2 billion), they can also be stored in fragments.

It seems that bitmap perfectly solves the problems of storage and statistics. Is there any more space-saving storage than this?

The answer is yes.

Redis has added the HyperLogLog data structure since 2.8.9. This data structure, according to the official website of redis, is a probabilistic data structure used to estimate the base of the data. Can reduce the consumption of memory space by sacrificing accuracy.

Let's take a look at the HyperLogLog method first

PFADD adds an element, if it is repeated, only counts as one
PFCOUNT returns an approximate value of the number of elements

PFMERGE merges multiple HyperLogLog into one HyperLogLog

This is easy to understand, isn't it. Then let's take a look at how much space is required for the HyperLogLog data structure to also store the activity of 100 million users. Is it more space-saving than bitmap?

I PFADD 100 million elements to HyperLogLog through a test project. The information of this key is counted through the rdb tools tool:

Insert picture description here

Only need 14392 Bytes! That is 14KB of space. Yes, you read that right. That is 14K. Bitmap needs 12M to store 100 million, while HyperLogLog only needs 14K of space.

This is an amazing result. I can't believe that using such a small space can store such a large amount of data.

Next, I put another 1000w data, and it was still 14k. In other words, no matter how much data you put in, it will be 14K.

I checked the document and found that HyperLogLog is a probabilistic data structure, which can count 2^64 data with a standard error of 0.81%. Therefore, HyperLogLog is suitable for scenarios where the accuracy is not high, such as statistics of daily and monthly activities.

HyperLogLog uses probabilistic algorithms to count the approximate cardinality of the set. The most original source of its algorithm is the Bernoulli process.

The Bernoulli process is a coin toss experiment process. Toss a normal coin, the landing may be heads or tails, and the probability of both is 1/2. The Bernoulli process is to toss a coin until the head position appears when it hits the ground, and the number of toss k is recorded. For example, a coin toss will appear heads, and k is 1 at this time; the first coin toss is a negative, then continue tossing until the third time it appears heads, when k is 3.

For n-time Bernoulli process, we will get n number of heads k1, k2… kn, where the maximum value here is k_max.

According to a mathematical derivation, we can draw a conclusion: 2^{k_max} is used as the estimated value of n. In other words, you can approximate the number of Bernoulli processes based on the maximum number of throws.

Although the HyperLogLog data type is so awesome, it is not accurate statistics after all. It is only suitable for scenes that do not require high precision. And this type cannot get the activity information of each user. After all, it's only 14K. It is impossible to store such a large amount of information.

To summarize: for the interview questions mentioned at the beginning of the article, both bitmap and HyperLogLog can be used.

The advantages of bitmap are: very balanced characteristics, accurate statistics, and the status of each statistical object can be obtained in seconds. The disadvantage is: When the number of your statistical objects is very large, it may take up a little storage space, but it can also be within the acceptable range. It can also be solved by additional means of fragmentation or compression.

The advantage of HyperLogLog is: it can count exaggerated to an unimaginable amount, and it takes up small exaggerated memory. The disadvantage is: it is built on the basis of sacrificing accuracy, and the status of each statistical object cannot be obtained.

Insert picture description here

Guess you like

Origin blog.csdn.net/liuxingjiaoyu/article/details/112896911