Shuangsong divorced and Bingbing broke up. The programmer who panicked most!

On June 27, Weibo went down again because of three hot news in a row: the divorce of Shuangsong, the death of Baoqiang's mother, and the breakup of Li Chen and Fan Bingbing. The majority of netizens may pay more attention to the news itself, leaving comments and expressing their opinions.


image.png

From the perspective of a programmer, out of professional habits, the first thing that comes to mind is his own back-end architecture, how to resist the huge flow of 3 hotspots a day!

image.png

Why use a cache cluster


In fact, when using a cache cluster, the most feared are the two situations of hot key and large value. So what is hot key and large value?


Simply put, a hot key means that a key in your cache cluster is instantly blown up by tens of thousands or even hundreds of thousands of concurrent requests.


Large Value means that the value corresponding to a certain Key of yours may have a size of GB, which causes network-related failures when querying Value.


Let's take a look at the following picture first. Assuming you have a system at hand, it is deployed in a cluster, and then there is a set of cache clusters. This cluster does not matter whether you use Redis Cluster, Memcached, or the company's self-developed cache cluster. It's all right.


So, what does this system do with a cache cluster? It's very simple. Put some data that doesn't change much in the cache, and then when the user queries a large amount of data that doesn't change much, can't they just go directly from the cache?


The concurrency of the cache cluster is very strong, and the performance of the read cache is very high. For example, suppose you have 20,000 requests per second, but 90% of them are read requests. Then 18,000 requests per second are reading data that does not change much, rather than writing data.


At this time, you put all these data in the database, and then send 20,000 requests per second to the database to read and write data. Do you think it is appropriate?


Of course it is not suitable. If you want to use a database to carry 20,000 requests per second, then I am sorry, you may have to engage in database and table + read-write separation.


For example, you score 3 main libraries, which carry 2000 write requests per second, and then each main library hangs 3 slave libraries, a total of 9 slave libraries carry 18,000 read requests per second.


In this case, you may need a total of 12 high-configuration database servers, which is very expensive, very costly, and very inappropriate.


Let’s take a look at the picture below to understand this situation:

image.png

Therefore, we can completely put the data that usually does not change in the cache cluster. The cache cluster can use 2 masters and 2 slaves. The master node is used to write to the cache, and the slave node is used to read the cache.


With the performance of the cache cluster, 2 slave nodes can be used to carry a large number of 18,000 read requests per second, and then 3 main database databases can carry 2000 write requests per second and a small number of other read requests.


In this way, the machines you are consuming instantly become 4 cache machines + 3 database machines = 7 machines. Does it reduce a lot of resource overhead compared to the previous 12 machines?


Yes, the cache is a very important part of the system architecture. In many cases, for data that rarely changes but a large amount of high concurrent reads, it is very suitable to resist high concurrent reads through a cache cluster.


Let's take a look at the following picture to experience this process:

image.png

It should be noted that all the number of machines and concurrent requests here are just examples. It is good for everyone to experience this meaning.


The main purpose is to give some background explanations to students who are not familiar with caching-related technologies, so that these students can understand what it means to use a cache cluster to host read requests in the system.


200,000 users simultaneously access a hot cache


Well, the background has been explained to you clearly, and now I can tell you about the issue that will be discussed today: hotspot cache.


Let's make an assumption. Now there are 10 cache nodes to resist a large number of read requests. Under normal circumstances, read requests should evenly fall on 10 cache nodes, right!


These 10 cache nodes carry 10,000 requests per second.


Then let's make another assumption. It is the limit for your node to carry 20,000 requests, so generally you limit one node to normally carry 10,000 requests and it is OK, leaving a little buffer out.


Well, what does the so-called hot cache problem mean? It's very simple. Suddenly, for inexplicable reasons, a large number of users access the same cached data.


For example, like yesterday, the divorce of Shuangsong, the death of Baoqiang's mother, and the breakup of Li Chen and Fan Bingbing. Will this trigger hundreds of thousands of users to check these hot news in a short period of time?


Assume that the above three news items are three caches, corresponding to three cache keys, and these keys all exist on one cache machine.


Then as soon as a certain piece of news is announced, for example, as soon as Fan Bingbing publishes Weibo, then hundreds of thousands of requests may rush to that machine in an instant.


What will happen now? Let's take a look at the picture below to experience this feeling of despair:

image.png

Obviously, what we just assumed is that a cache slave node can have at most 20,000 requests per second. Of course, it is also possible for a single cache to carry 50,000 to 100,000 read requests. Here is an assumption.


As a result, 200,000 requests suddenly come to this machine every second, what will happen? Quite simply, the cache machine pointed to by 200,000 requests in the picture above will be overworked and downtime.


So what if the cache cluster starts to experience machine downtime? At this time, the read request finds that the data cannot be read, it will extract the original data from the database, and then put it into the remaining other cache machines.


But the ensuing 200,000 requests per second will once again overwhelm other cache machines. By analogy, the cache cluster will eventually collapse and the entire system will be down.


Let's take a look at the following picture, and then feel this horrible scene:

image.png

Automatic discovery of cache hot spots based on streaming computing technology


In fact, the key point here is that for this kind of hotspot cache, your system needs to be able to find it directly when the hotspot cache suddenly occurs, and then instantly realize millisecond-level automatic load balancing.


So let's talk about it first, how do you automatically discover hot cache problems? First of all, you need to know that when cache hotspots generally appear, your concurrency per second is definitely very high. It may be hundreds of thousands or even millions of requests per second. This is all possible.


Therefore, at this time, it is completely possible to perform real-time data access statistics based on the streaming computing technology in the big data field, such as Storm, Spark Streaming, and Flink.


Once in the process of real-time data access count statistics, for example, it is found that within 1 second, a certain piece of data has suddenly been accessed more than 1000 times, and the data is immediately judged as hot data, and the discovered hot data can be written Into such as Zookeeper.


Of course, how your system determines the hotspot data can be based on your own business and experience values.


Let's take a look at the picture below to see how the whole process goes:

image.png

Someone here will certainly ask, when your streaming computing system is performing statistics on the number of data accesses, is there a problem that a single machine is requested hundreds of thousands of times per second?


答案是:否。因为流式计算技术,尤其是 Storm 这种系统,他可以做到同一条数据的请求过来,先分散在很多机器里进行本地计算,最后再汇总局部计算结果到一台机器进行全局汇总。


所以几十万请求可以先分散在比如 100 台机器上,每台机器统计了这条数据的几千次请求。


然后 100 条局部计算好的结果汇总到一台机器做全局计算即可,所以基于流式计算技术来进行统计是不会有热点问题的。

image.png

热点缓存自动加载为 JVM 本地缓存


我们自己的系统可以对 Zookeeper 指定的热点缓存对应的 Znode 进行监听,如果有变化他立马就可以感知到了。


此时系统层就可以立马把相关的缓存数据从数据库加载出来,然后直接放在自己系统内部的本地缓存里即可。


这个本地缓存,你用 Ehcache、Hashmap,其实都可以,一切看自己的业务需求。


我们这里主要说的就是将缓存集群里的集中式缓存,直接变成每个系统自己本地实现缓存即可,每个系统本地是无法缓存过多数据的。


因为一般这种普通系统单实例部署机器可能就一个 4 核 8G 的机器,留给本地缓存的空间是很少的,所以用来放这种热点数据的本地缓存是最合适的,刚刚好。


假设你的系统层集群部署了 100 台机器,那么好了,此时你 100 台机器瞬间在本地都会有一份热点缓存的副本。


然后接下来对热点缓存的读操作,直接系统本地缓存读出来就给返回了,不用再走缓存集群了。


这样的话,也不可能允许每秒 20 万的读请求到达缓存机器的一台机器上读一个热点缓存了,而是变成 100 台机器每台机器承载数千请求,那么那数千请求就直接从机器本地缓存返回数据了,这是没有问题的。


我们再来画一幅图,一起来看看这个过程:

image.png

限流熔断保护


除此之外,在每个系统内部,其实还应该专门加一个对热点数据访问的限流熔断保护措施。


每个系统实例内部,都可以加一个熔断保护机制,假设缓存集群最多每秒承载 4 万读请求,那么你一共有 100 个系统实例。


You should limit yourself. Each system instance can request cache cluster read operations no more than 400 times per second. Once it exceeds the limit, it can be fused. If the cache cluster is not requested, a blank message will be returned directly, and the user will do it again later. Refresh the page or something.


By directly adding current-limiting fuse protection measures at the system layer itself, the following cache clusters and database clusters can be protected from being killed.


Let's take another picture and take a look together:

image.png

to sum up


Specifically, should we implement this complex cache hot spot optimization architecture in the system? This depends on whether your own system has such a scenario.


If your system has a hotspot cache problem, then you must implement a complex hotspot cache support architecture similar to this article.


But if not, then don't over-design. In fact, your system may not need such a complicated architecture.


If it is the latter, then everyone should read this article to understand the corresponding architecture ideas!


Guess you like

Origin blog.51cto.com/14410880/2550574