Community Favorites Cache Design Refactoring Practice

Original Sky Get Things Technology

1. Background

The community collection business is a typical scenario of reading more and writing less. Various core feeds in the community need to rely on the data judgment of whether users have collected or not. In the early cache design, because the traffic was not very large, there was no obvious problem. Recently, through monitoring Some related problems have been found in the platform and other related methods, so we have redesigned the cache to address these problems to ensure the performance and stability of the collection business.

Second, the problem analysis and positioning

2.1 The interface RT is too large

The RT of "judging whether to save the interface" through the monitoring platform is at a maximum of about 8ms. The main function of this interface is to judge whether a specified single user has saved a batch of content. In fact, if the cache hit rate is high, the interface RT should be close to The RT level of Redis is about 1-2ms.

(There is a single thorn in the figure, this specific problem needs to be analyzed and optimized in detail, here we mainly describe the optimization of the overall level)

 

2.2 Redis&MySQL access QPS is too high

Through the monitoring platform, it can be seen that the QPS of the collection query from the upstream service is 15 times larger than the QPS of accessing the Redis cache, and the highest QPS of the MySQL query accounts for nearly 37% of the upstream traffic , which shows that the cache does not have a high hit rate. The probability of causing back-to-table queries is still very high.

The number of QPS visits is shown in the figure below:

Redis traffic

MySQL traffic

Based on the above analysis, we now have a clear optimization entry point. Next, let's take a look at the specific reasons.

Next, let's take a look at the pseudo-code implementation:

//判断用户是否对指定的动态收藏
func IsLightContent(userId uint64,contentIds []uint64){
    index := userId%20
    cacheKey := key + "_" + fmt.Sprintf("%d", index)
    pipe := redis.GetClient().Pipeline()
    for _, item := range contentIds {
        InitCache(userId, contentId)
        pipe.SisMember(cacheKey, userId)
    }
    pipe.Exec()
    //......
}

//缓存初始化判断,不存在则初始化数据缓存
func InitCache(userId uint64,contentId uint64){
    index := userId%20
    cacheKey := key + "_" + fmt.Sprintf("%d", index)
    ttl,_ := redis.GetClient().TTL(cacheKey)
    if ttl <= 0{//key不存在或者未设置过期时间
        // query from db
        // sql := "select userId from trendFav where userId%20 = index and content_id = contentId"
        // save to redis
    }else{
       redis.GetClient().Expire(cacheKey,time.Hour()*48)
    }
}

 

From the above pseudo code, we can clearly see that this method will traverse the content id collection, and then query the cached user collection for each content to determine whether the current user is a favorite. That is to say, the cache design is designed according to the content dimension and user 1:N, and all user IDs who have collected content under a single dynamic are found and cached. And based on the consideration of the big key, the code divides the user set into 20 groups. This undoubtedly magnifies the number of Redis cache keys again. And each Key uses the TTL command to determine whether it expires. In this way, the QPS and cache keys of Redis will be amplified many times.

It is precisely due to the sharding strategy and the short cache duration that the QPS of MySQL queries remains high.

3. Solutions

Based on the above analysis and positioning of the problem, the solution we think about is to reduce the Redis query operation with one interface request and reduce the amplification as much as possible. The preliminary judgment has the following two implementation paths:

  • Remove the traversal content query and change it to a one-time query
  • Remove user set shard storage and change to single key storage

The upstream call parameter user and content have a one-to-many relationship, so the Redis query to be implemented must also satisfy a one-to-many relationship. Obviously, our cache should store the collection of collected content according to the dimension of the user.

If the user's favorite content is relatively small, we can easily query all the content from the database and put it in the cache, but if the user's favorite content is relatively large, it may also cause a large key problem. back to the original plan. We discuss the following two options:

Option 1. Most of the conventional ideas for dealing with big data are either sharding or separation of hot and cold

Because of the characteristics of business logic, most of the content that users see in the recommended stream is basically within one year. We can cache the user's favorite content within one year, which limits the extreme number of user favorites. If the content you see has been published for more than a year, you can use MySQL to query it directly. The case probability of this scenario is very small. But after careful consideration of the implementation, this needs to rely on the business side. We need to check the release time of the content to determine whether it is in our cache, which will increase the logic of the entire interface, but it will not be worth the loss, so this idea will soon be was denied.

Option 2. Since you cannot rely on a third party, you should be able to cache some of the hottest data from your own information, so that queries can fall on these data on a large scale

We currently only have content ids , and content ids are pure numbers, and the numbers themselves can be arranged according to size. The business query itself is the content of the recent period, so the content id of the query is the id of the recent larger . Then we can sort them in descending order by content id , and take several pieces of data that the user has collected to cache. As long as the id of the query is larger than the smallest id in the cache , we can only use the cache to determine whether the user has favorited the content.

Example:

When initializing the cache, we sort in descending order by content id , and get the first 5000 content ids :

  • If the query result is less than 5000, then the user has cached all the favorite records, and the content id of the small cache is 0 .
  • If it is greater than or equal to 5000, it means that there are still some uncached records. At this time, the minimum cached content ID is the 5000th content ID

When the query is judged, compare the queried content id array with the minimum content id of the cache . If all of them are greater than that, it means that they are all within the cache range. If they are less than, it is beyond the cache range. Then go to the database to judge separately. Of course, this kind of The probability of occurrence in business is relatively small.

The choice of the number of caches here is particularly important. If it is too small, the hit rate of the cache is not high, resulting in a higher probability of MySQL return table query. If it is too large, it will take time to initialize, or cause a large key problem. After analyzing the online data, the current number of 5000 can be a better trade-off.

The following is the query cache judgment flow chart:

 

The cache mode is changed from the original set structure to the Hash structure, and the TTL is extended to 7 * 24 hour.

In this way, the original independently called TTL and sismember commands can be combined into one Hmget command, which reduces the number of Redis accesses by half. The improvement benefits are considerable.

4. Optimization results

As of the writing of this article, we have optimized the collection function and launched it, and have made very good progress. All data are from 4.14 to 4.20 in the last 7 days, and the optimization effect starts around 17:00 on the 4.15th.

4.1 RPC interface responds to RT reduction

1 IsCollectionContent

RPC interface to determine whether the dynamic is cached. The average RT increased by nearly 3 times, and the RT was relatively stable.

4.2 Redis load reduction

1 TTL query

Query the validity period of the key, which is used to determine the extension of the validity period of the key. QPS drops directly to 0

2 SISMEMBER query

The old favorite cache query has been changed to HMGET query and the QPS has been reduced to 0

3 HMGET query

The number of new favorite cache query QPS corresponds to the QPS query from upstream

4 Redis memory reduction

The new cache is about 3 times less than the old cache in terms of memory usage and number of keys.

4.3 MySQL load reduction

1 Content_collection table select query is reduced

QPS is reduced by about 24 times and remains at a relatively stable water level

2 The number of concurrent MySQL connections is reduced

The reduction of query QPS also reduces the number of concurrent connections by about 3 times, and ultimately reduces the number of waiting connections.

 

V. Summary

After analyzing and solving this problem, it is not difficult to see how important a good cache design is for services. A good cache design can not only improve performance, but also reduce resource usage and improve resource utilization as a whole. At the same time, the downstream traffic is basically the same as the upstream traffic. When the traffic increases, it will not cause great pressure on the downstream, so the overall anti-concurrency capability of the service is also greatly improved.

 

*Text/Sky

Pay attention to Dewu Technology, and update the technical dry goods at 18:30 every Monday, Wednesday, and Friday night.
If you think the article is helpful to you, please comment, forward it and like it~

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5783135/blog/5578400