How does Redis count 100 million keys?

Preface

I wonder if you have used it on a large scale Redis? Or is it just a caching tool? The most commonly used collection in Redis is the following scenario:

  1. In the sign-in system, one day corresponds to a series of user sign-in records.

  2. In the e-commerce system, a product corresponds to a series of reviews.

  3. In the dating system, a series of friends of a user.

The characteristic of a collection in Redis is nothing more than a Keycorresponding series of data, but the role of data is often for statistics, such as:

  1. In the friendship system, it is necessary to count the new friends added every day, as well as the mutual friends of both parties.

  2. In the e-commerce system, it is necessary to count the latest comments in the comment list.

  3. In the sign-in system, it is necessary to count the number of users who signed in for a consecutive month.

In large-scale Internet applications, the amount of data is huge, at least millions, tens of millions, or even 100 million, such as e-commerce giant Taobao, dating giants WeChat and Weibo; office giant DingTalk, etc., which of them does not have hundreds of millions of users? ?

Only by selecting appropriate sets for different scenarios can statistics be more convenient.

aggregate statistics

聚合统计Refers to the result of aggregation of multiple elements, such as counting the intersection, union, and difference of multiple sets.

When you need to aggregate statistics on multiple collections, Set collection is a good choice. In addition to no duplicate data, Redis also provides corresponding APIs.

intersection

In the above example, the statistics of common friends of both parties in the dating system are exactly what is included in the aggregate statistics 交集.

RedisYou can useridact as keya friend in , as shown below:useridvalue

 

To count the common friends of two users, only Setthe intersection of the two sets is required. The command is as follows;

SINTERSTORE userid:new userid:20002 userid:20003

After the above command is executed, userid:newwhat is stored in this key will be the intersection userid:20002of userid:20003the two sets.

difference set

For example: Suppose the dating system needs to count the number of new friends added every day. At this time, it is necessary to take the difference set of the friend sets of the past two days. For example, the friends of the day are, and the friends of the day are. At this time, you 2020/11/1only set1need 2020/11/2to set2compare set1and set2Make difference set.

How should the structure be designed at this time? As shown below:

 

userid:20201101This keyrecords useridthe user's 2020/11/1daily friend collection.

The difference set is very simple, you only need to execute SDIFFSTOREthe command, as follows:

SDIFFSTORE  user:new  userid:20201102 userid:20201101  

After the execution is completed, user:newthe collection at this time will include 2020/11/2new friends added daily.

Here is a more relevant example. There is a function of people you may know on Weibo. You can use the difference set, that is, your friends' friends minus your mutual friends are the people you may know.

union

Still using the difference set example, suppose we need to count the 2020/11/01total 2020/11/2number of new friends. At this time, we only need to make a union of the sets of new friends in the past two days. The command is as follows:

SUNIONSTORE  userid:new userid:20201102 userid:20201101

The new collection at this time userid:newis the friends added in the past two days.

Summarize

SetThe computational complexity of intersection and union of sets is very high. If the amount of data is large, it may cause Redis to block.

So how to avoid blocking? suggestions below:

  1. Select a slave database in Redisthe cluster to be responsible for aggregating statistics, so that it will not block the main database and other slave databases.

  2. The data is handed over to the client, and the client performs aggregate statistics.

Sorting statistics

On some e-commerce websites, you can see that product reviews are always the latest. How is this done?

The latest comments list contains all comments, which requires a collection to store the elements in order. That is to say, the elements in the set must be stored in order, which is called an ordered set.

RedisOf the four sets in List, Sorted Setthe sum belongs to the ordered set.

But what is the difference between Listand Sorted Set? Which one to use?

List is sorted according to the order in which elements enter, while Sorted Set can be sorted according to the weight of elements. For example, the weight can be determined based on the time when elements are inserted into the collection. Elements inserted first have a smaller weight, while elements inserted later have a larger weight.

For this example, obviously both of them can meet the requirements, the paging query command LRANGEand Sorted Setthe paging query command in List ZRANGEBYSCORE.

But in terms of flexibility, List is definitely not suitable. List can only be sorted according to the order of insertion. However, in most scenarios, it may not just be sorted by time, but may also be sorted according to some specific conditions. In this case, it is very difficult Sorted Set. If it's appropriate, you just need to generate the corresponding weights based on the unique algorithm.

Binary state statistics

Binary status refers to two values: 0 or 1; in the sign-in and clock-in scenario, only two statuses, sign-in (1) and not-check-in (0), need to be recorded. This is a typical binary status statistics.

Binary state statistics can use Redisextended data types Bitmap. The bottom layer is implemented using Stringtypes, which can be regarded as an bitarray. Follow-up introduction for details...  

In the sign-in statistics, 0and 1only account for one bit, even though the sign-in data for one year only has 365 bitdigits. Greatly reduces storage space.

Bitmap provides GETBIT/SETBIT operations  offset to read and write a certain bit of the bit array using an offset value. However, it should be noted that the offset of Bitmap is calculated from 0, which means  offset that the minimum value is 0. When  SETBIT a bit is written, the bit will be set to 1. Bitmap also provides  BITCOUNT operations to count all 1the numbers in this bit array.

How to design key values? The key can be userid:yyyyMM, that is, the unique id plus the month. Assume that the employee ID is 10001, and we need to count 2020/11the check-in and clock-in records for the month.

The first step is to execute the command to set the value. Assuming that the clock is punched in on November 2, the command is as follows:

SETBIT userid:10001:202011 1 1 

BitMap starts from subscript 0, so No. 2 has a subscript of 1. If the value is set to 1, it means the check-in is successful.

The second step is to check whether the user clocked in on November 2. The command is as follows:

GETBIT userid:10001:202011 1 

The third step is to count the number of punch-ins in November. The command is as follows:

BITCOUNT userid:10001:202011

So here comes the question. You need to count the total number of users who have checked in for 20 consecutive days in your check-in system. How to deal with it? Assume there are 100 million users.

For example, if you need to count the number of people who clock in continuously 2020/11/01in 2020/11/20Tianzhong, how to count them?

BitmapIt also supports bitwise , , and operations on multiple BitMap at the same time. 异或The commands are as follows:

 

Here comes the idea, we can use each day's date as one key, and correspondingly BitMapstore the check-in status of 100 million users on that day. As shown below:

 

At this time, we only need to perform bit-by-bit operations on 2020/11/1the arrival 2020/11/20number . The value corresponding to each bit position in the final one represents the situation of punching in for 20 consecutive days. Only when all punching in for 20 consecutive days will the value of the bit position be 1. As shown below:BitmapBitmap

 

Finally, you can use BITCOUNTthe command to perform statistics.

You can try to calculate the memory overhead. Using a 100 million-bit Bitmap every day takes up approximately  12MB ( ) memory 10^8/8/1024/1024. The memory overhead of a 20-day Bitmap is about 100 million bits  240MB. The memory pressure is not too great. However, in actual application, it is best to  Bitmap set an expiration time and let Redis automatically delete check-in records that are no longer needed to save memory overhead.

If binary status is involved, such as whether a user exists, checks in, or whether a product exists, you can use Bitmap, which can effectively save memory space.

Cardinal statistics

Cardinality statistics refers to counting the number of unique elements in a set.

For example: In e-commerce websites, it is usually necessary to count each web page UVto determine the weight. The UV of the web page must be deduplicated. Redis type Setsupports deduplication. The first thing that comes to mind is Set.

But there is a problem here. SetThe bottom layer uses hash tables and integer arrays. If the UV of a web page reaches tens of millions (more than one page in an e-commerce website), then the memory consumption will be extremely large.

Redis provides an extended type HyperLogLog for cardinality statistics. Calculating 2^64 elements only requires about 12KB of memory space.

Isn’t it exciting? But HyperLogLogthere is an error, probably 0.81%, if you need accurate statistics, you still need to use it Set. For the UV of this kind of web page, it is enough.

When counting web page UVs, you only need to store the user's unique ID in HyperLogLog, as follows:

PFADD p1:uv 10001 10002 10003 10004

If there are duplicate elements, they will be automatically removed.

Statistics are also very simple. Use PFCOUNTthe command as follows:

PFCOUNT p1:uv

Summarize

This article introduces several types of statistics and what collections should be used for storage. In order to facilitate understanding, the author summarizes the support status, advantages and disadvantages in a table, as shown below:

 

Setand Sorted Setsupports the aggregation operations of intersection and union, but Sorted Setdoes not support the difference operation.

BitmapIt can also perform AND, XOR, or aggregation operations on multiple Bitmaps.

ListBoth SortedSetsupport sorting statistics, but List is sorted according to the order in which elements are inserted, and Sorted Set supports weights, which is more flexible than List sorting.

For scenarios such as binary state statistics and determining whether an element exists, it is recommended to use it Bitmapto save memory space.

For cardinality statistics, it is recommended to use it when the amount of data is large and accuracy is not required HyperLogLogto save memory space; for accurate cardinality statistics, it is best to use Setsets.

Guess you like

Origin blog.csdn.net/WXF_Sir/article/details/131449033