Preface
I wonder if you have used it on a large scale Redis
? Or is it just a caching tool? The most commonly used collection in Redis is the following scenario:
-
In the sign-in system, one day corresponds to a series of user sign-in records.
-
In the e-commerce system, a product corresponds to a series of reviews.
-
In the dating system, a series of friends of a user.
The characteristic of a collection in Redis is nothing more than a Key
corresponding series of data, but the role of data is often for statistics, such as:
-
In the friendship system, it is necessary to count the new friends added every day, as well as the mutual friends of both parties.
-
In the e-commerce system, it is necessary to count the latest comments in the comment list.
-
In the sign-in system, it is necessary to count the number of users who signed in for a consecutive month.
In large-scale Internet applications, the amount of data is huge, at least millions, tens of millions, or even 100 million, such as e-commerce giant Taobao, dating giants WeChat and Weibo; office giant DingTalk, etc., which of them does not have hundreds of millions of users? ?
Only by selecting appropriate sets for different scenarios can statistics be more convenient.
aggregate statistics
聚合统计
Refers to the result of aggregation of multiple elements, such as counting the intersection, union, and difference of multiple sets.
When you need to aggregate statistics on multiple collections, Set collection is a good choice. In addition to no duplicate data, Redis also provides corresponding APIs.
intersection
In the above example, the statistics of common friends of both parties in the dating system are exactly what is included in the aggregate statistics 交集
.
Redis
You can userid
act as key
a friend in , as shown below:userid
value
To count the common friends of two users, only Set
the intersection of the two sets is required. The command is as follows;
SINTERSTORE userid:new userid:20002 userid:20003
After the above command is executed, userid:new
what is stored in this key will be the intersection userid:20002
of userid:20003
the two sets.
difference set
For example: Suppose the dating system needs to count the number of new friends added every day. At this time, it is necessary to take the difference set of the friend sets of the past two days. For example, the friends of the day are, and the friends of the day are. At this time, you 2020/11/1
only set1
need 2020/11/2
to set2
compare set1
and set2
Make difference set.
How should the structure be designed at this time? As shown below:
userid:20201101
This key
records userid
the user's 2020/11/1
daily friend collection.
The difference set is very simple, you only need to execute SDIFFSTORE
the command, as follows:
SDIFFSTORE user:new userid:20201102 userid:20201101
After the execution is completed, user:new
the collection at this time will include 2020/11/2
new friends added daily.
Here is a more relevant example. There is a function of people you may know on Weibo. You can use the difference set, that is, your friends' friends minus your mutual friends are the people you may know.
union
Still using the difference set example, suppose we need to count the 2020/11/01
total 2020/11/2
number of new friends. At this time, we only need to make a union of the sets of new friends in the past two days. The command is as follows:
SUNIONSTORE userid:new userid:20201102 userid:20201101
The new collection at this time userid:new
is the friends added in the past two days.
Summarize
Set
The computational complexity of intersection and union of sets is very high. If the amount of data is large, it may cause Redis to block.
So how to avoid blocking? suggestions below:
-
Select a slave database in
Redis
the cluster to be responsible for aggregating statistics, so that it will not block the main database and other slave databases. -
The data is handed over to the client, and the client performs aggregate statistics.
Sorting statistics
On some e-commerce websites, you can see that product reviews are always the latest. How is this done?
The latest comments list contains all comments, which requires a collection to store the elements in order. That is to say, the elements in the set must be stored in order, which is called an ordered set.
Redis
Of the four sets in List
, Sorted Set
the sum belongs to the ordered set.
But what is the difference between List
and Sorted Set
? Which one to use?
List is sorted according to the order in which elements enter, while Sorted Set can be sorted according to the weight of elements. For example, the weight can be determined based on the time when elements are inserted into the collection. Elements inserted first have a smaller weight, while elements inserted later have a larger weight.
For this example, obviously both of them can meet the requirements, the paging query command LRANGE
and Sorted Set
the paging query command in List ZRANGEBYSCORE
.
But in terms of flexibility, List is definitely not suitable. List can only be sorted according to the order of insertion. However, in most scenarios, it may not just be sorted by time, but may also be sorted according to some specific conditions. In this case, it is very difficult Sorted Set
. If it's appropriate, you just need to generate the corresponding weights based on the unique algorithm.
Binary state statistics
Binary status refers to two values: 0 or 1; in the sign-in and clock-in scenario, only two statuses, sign-in (1) and not-check-in (0), need to be recorded. This is a typical binary status statistics.
Binary state statistics can use Redis
extended data types Bitmap
. The bottom layer is implemented using String
types, which can be regarded as an bit
array. Follow-up introduction for details...
In the sign-in statistics, 0
and 1
only account for one bit
, even though the sign-in data for one year only has 365 bit
digits. Greatly reduces storage space.
Bitmap provides GETBIT/SETBIT
operations offset
to read and write a certain bit of the bit array using an offset value. However, it should be noted that the offset of Bitmap is calculated from 0, which means offset
that the minimum value is 0. When SETBIT
a bit is written, the bit will be set to 1. Bitmap also provides BITCOUNT
operations to count all 1
the numbers in this bit array.
How to design key values? The key can be userid:yyyyMM
, that is, the unique id plus the month. Assume that the employee ID is 10001
, and we need to count 2020/11
the check-in and clock-in records for the month.
The first step is to execute the command to set the value. Assuming that the clock is punched in on November 2, the command is as follows:
SETBIT userid:10001:202011 1 1
BitMap starts from subscript 0, so No. 2 has a subscript of 1. If the value is set to 1, it means the check-in is successful.
The second step is to check whether the user clocked in on November 2. The command is as follows:
GETBIT userid:10001:202011 1
The third step is to count the number of punch-ins in November. The command is as follows:
BITCOUNT userid:10001:202011
So here comes the question. You need to count the total number of users who have checked in for 20 consecutive days in your check-in system. How to deal with it? Assume there are 100 million users.
For example, if you need to count the number of people who clock in continuously 2020/11/01
in 2020/11/20
Tianzhong, how to count them?
Bitmap
与
It also supports bitwise , 或
, and operations on multiple BitMap at the same time. 异或
The commands are as follows:
Here comes the idea, we can use each day's date as one key
, and correspondingly BitMap
store the check-in status of 100 million users on that day. As shown below:
At this time, we only need to perform bit-by-bit operations on 2020/11/1
the arrival 2020/11/20
number . The value corresponding to each bit position in the final one represents the situation of punching in for 20 consecutive days. Only when all punching in for 20 consecutive days will the value of the bit position be 1. As shown below:Bitmap
与
Bitmap
Finally, you can use BITCOUNT
the command to perform statistics.
You can try to calculate the memory overhead. Using a 100 million-bit Bitmap every day takes up approximately 12MB
( ) memory 10^8/8/1024/1024
. The memory overhead of a 20-day Bitmap is about 100 million bits 240MB
. The memory pressure is not too great. However, in actual application, it is best to Bitmap
set an expiration time and let Redis automatically delete check-in records that are no longer needed to save memory overhead.
If binary status is involved, such as whether a user exists, checks in, or whether a product exists, you can use Bitmap, which can effectively save memory space.
Cardinal statistics
Cardinality statistics refers to counting the number of unique elements in a set.
For example: In e-commerce websites, it is usually necessary to count each web page UV
to determine the weight. The UV of the web page must be deduplicated. Redis type Set
supports deduplication. The first thing that comes to mind is Set.
But there is a problem here. Set
The bottom layer uses hash tables and integer arrays. If the UV of a web page reaches tens of millions (more than one page in an e-commerce website), then the memory consumption will be extremely large.
Redis provides an extended type HyperLogLog for cardinality statistics. Calculating 2^64 elements only requires about 12KB of memory space.
Isn’t it exciting? But HyperLogLog
there is an error, probably 0.81%
, if you need accurate statistics, you still need to use it Set
. For the UV of this kind of web page, it is enough.
When counting web page UVs, you only need to store the user's unique ID in HyperLogLog, as follows:
PFADD p1:uv 10001 10002 10003 10004
If there are duplicate elements, they will be automatically removed.
Statistics are also very simple. Use PFCOUNT
the command as follows:
PFCOUNT p1:uv
Summarize
This article introduces several types of statistics and what collections should be used for storage. In order to facilitate understanding, the author summarizes the support status, advantages and disadvantages in a table, as shown below:
Set
and Sorted Set
supports the aggregation operations of intersection and union, but Sorted Set
does not support the difference operation.
Bitmap
It can also perform AND, XOR, or aggregation operations on multiple Bitmaps.
List
Both SortedSet
support sorting statistics, but List is sorted according to the order in which elements are inserted, and Sorted Set supports weights, which is more flexible than List sorting.
For scenarios such as binary state statistics and determining whether an element exists, it is recommended to use it Bitmap
to save memory space.
For cardinality statistics, it is recommended to use it when the amount of data is large and accuracy is not required HyperLogLog
to save memory space; for accurate cardinality statistics, it is best to use Set
sets.