Large-scale concurrency of web systems - e-commerce spikes and snap-ups

E-commerce spikes and snap-ups are not unfamiliar to us. However, from a technical point of view, this is a huge test for Web systems. When a web system receives tens of thousands or even more requests in one second, the optimization and stability of the system are crucial. This time, we will focus on the technical implementation and optimization of spikes and snap-ups. At the same time, from the technical level, we will uncover the reasons why it is not easy for us to grab train tickets? 

1. Challenges brought by large-scale concurrency 

In the past work, I once faced the high concurrency spike function of 5w per second. During this process, the entire web system encountered many problems and challenges. If the Web system is not optimized in a targeted manner, it will easily fall into an abnormal state. Let's discuss the ideas and methods of optimization together now. 

1. Reasonable design of request interface

A seckill or snap-up page is usually divided into two parts, one is static HTML and other content, and the other is the web background request interface that participates in the seckill.

Usually, content such as static HTML is deployed through CDN. Generally, there is little pressure. The core bottleneck is actually on the background request interface. This back-end interface must be able to support high concurrent requests. At the same time, it is very important that it must be as "fast" as possible, returning the user's request result in the shortest time possible. To achieve this as fast as possible, it would be better for the backend storage of the interface to use memory-level operations. It is not suitable to directly face storage such as MySQL. If there is such a complex business requirement, it is recommended to use asynchronous writing.

 

Of course, there are also some seckills and rush purchases that use "lag feedback", which means that the seckill does not know the result at the moment, and it is only after a period of time that the user can see whether the user has succeeded in the seckill. However, this is a "lazy" behavior, and at the same time, the user experience is not good, and it is easily considered by users as "black box operation".

2. The challenge of high concurrency: must be "fast"

We usually measure the throughput rate of a web system by QPS (Query Per Second, the number of requests processed per second), which is very critical to solve high concurrency scenarios of tens of thousands of times per second. For example, we assume that the average response time for processing a business request is 100ms, and at the same time, there are 20 Apache web servers in the system, and the MaxClients is configured to 500 (representing the maximum number of Apache connections).

Then, the theoretical peak QPS of our web system is (ideally calculated):

20*500/0.1 = 100000 (10万QPS)

Huh? Our system seems to be very powerful, it can handle 100,000 requests in 1 second, and the 5w/s spike seems to be a "paper tiger". The actual situation, of course, is not so ideal. In the actual scenario of high concurrency, the machines are in a state of high load, and the average response time will be greatly increased at this time.

As far as the web server is concerned, the more connection processes Apache opens, the more context switches the CPU needs to handle, which increases CPU consumption and directly increases the average response time. Therefore, the above-mentioned number of MaxClients should be comprehensively considered according to hardware factors such as CPU and memory, and the more the better. You can test it through the abench that comes with Apache and take an appropriate value. Then, we choose Redis for memory operation level storage. In a state of high concurrency, the response time of the storage is critical. Although network bandwidth is also a factor, such request packets are generally small and rarely become the bottleneck of requests. It is rare for load balancing to become a system bottleneck, so I will not discuss it here.

Then the problem comes, assuming our system, in the high concurrency state of 5w/s, the average response time changes from 100ms to 250ms (the actual situation, or even more):

20*500/0.25 = 40000 (4万QPS)

As a result, our system has 4w of QPS left, and in the face of 5w of requests per second, there is a difference of 1w in the middle.

Then, this is where the real nightmare begins. For example, at a high-speed intersection, 5 cars come in 1 second, and 5 cars pass through every second, and the high-speed intersection works normally. Suddenly, only 4 vehicles can pass through this intersection in 1 second, and the traffic flow is still the same. As a result, there must be a big traffic jam. (5 lanes suddenly become 4 lanes)

Similarly, in a certain second, 20*500 available connection processes are working at full load, but there are still 10,000 new requests, no connection process is available, and the system is expected to fall into an abnormal state.

 

In fact, in normal non-highly concurrent business scenarios, a similar situation occurs. There is a problem with a business request interface, and the response time is extremely slow. The response time of the entire web request is prolonged, and the number of available connections to the web server is gradually increased. Occupied, other normal business requests, no connection process is available.

The more terrifying problem is that it is the behavioral characteristics of users. The more unusable the system is, the more frequently users click, and the vicious circle eventually leads to an "avalanche" (one of the web machines hangs, causing traffic to be scattered to other machines that are working normally. , and then cause the normal machine to hang, and then a vicious circle), dragging down the entire Web system.

3. Restart and overload protection

If an "avalanche" occurs in the system, restarting the service rashly will not solve the problem. The most common phenomenon is that it hangs up immediately after starting up. At this time, it is best to deny the traffic at the ingress layer, and then restart it. If the service like redis/memcache also hangs, you need to pay attention to "warm-up" when restarting, and it may take a long time.

In the scene of spike and panic buying, the traffic is often beyond our system's preparation and imagination. At this time, overload protection is necessary. Rejecting requests is also a safeguard if a full system load state is detected. Setting up filtering on the front end is the easiest way, however, this practice is a behavior that is "pointed by thousands of people" by users. It is more appropriate to set the overload protection at the CGI entry layer to quickly return the direct request of the client.

The offense and defense

Lightning strikes and snap-ups have received "massive" requests, but in fact the moisture inside is very large. Many users, in order to "grab" the goods, use auxiliary tools such as "ticket brushing tools" to help them send as many requests to the server as possible. There are also some advanced users who make powerful automatic request scripts. The reason for this is also very simple, that is, in the requests to participate in the flash sale and snap-up, the more requests you make, the higher the probability of success.

These are all means of "violating the rules". However, if there is "offense", there is "defense". This is a battle without gunpowder smoke.

1. The same account, make multiple requests at one time

Some users send hundreds or even more requests at a time with their own accounts through browser plug-ins or other tools during the start of the Lightning Deal. In fact, such users undermine the fairness of spikes and snap-ups.

这种请求在某些没有做数据安全处理的系统里,也可能造成另外一种破坏,导致某些判断条件被绕过。例如一个简单的领取逻辑,先判断用户是否有参与记录,如果没有则领取成功,最后写入到参与记录中。这是个非常简单的逻辑,但是,在高并发的场景下,存在深深的漏洞。多个并发请求通过负载均衡服务器,分配到内网的多台Web服务器,它们首先向存储发送查询请求,然后,在某个请求成功写入参与记录的时间差内,其他的请求获查询到的结果都是“没有参与记录”。这里,就存在逻辑判断被绕过的风险。

 

 

应对方案:

在程序入口处,一个账号只允许接受1个请求,其他请求过滤。不仅解决了同一个账号,发送N个请求的问题,还保证了后续的逻辑流程的安全。实现方案,可以通过Redis这种内存缓存服务,写入一个标志位(只允许1个请求写成功,结合watch的乐观锁的特性),成功写入的则可以继续参加。

 

或者,自己实现一个服务,将同一个账号的请求放入一个队列中,处理完一个,再处理下一个。

2. 多个账号,一次性发送多个请求

很多公司的账号注册功能,在发展早期几乎是没有限制的,很容易就可以注册很多个账号。因此,也导致了出现了一些特殊的工作室,通过编写自动注册脚本,积累了一大批“僵尸账号”,数量庞大,几万甚至几十万的账号不等,专门做各种刷的行为(这就是微博中的“僵尸粉“的来源)。举个例子,例如微博中有转发抽奖的活动,如果我们使用几万个“僵尸号”去混进去转发,这样就可以大大提升我们中奖的概率。

这种账号,使用在秒杀和抢购里,也是同一个道理。例如,iPhone官网的抢购,火车票黄牛党。

 

应对方案:

这种场景,可以通过检测指定机器IP请求频率就可以解决,如果发现某个IP请求频率很高,可以给它弹出一个验证码或者直接禁止它的请求:

 

  1. 弹出验证码,最核心的追求,就是分辨出真实用户。因此,大家可能经常发现,网站弹出的验证码,有些是“鬼神乱舞”的样子,有时让我们根本无法看清。他们这样做的原因,其实也是为了让验证码的图片不被轻易识别,因为强大的“自动脚本”可以通过图片识别里面的字符,然后让脚本自动填写验证码。实际上,有一些非常创新的验证码,效果会比较好,例如给你一个简单问题让你回答,或者让你完成某些简单操作(例如百度贴吧的验证码)。
  2. 直接禁止IP,实际上是有些粗暴的,因为有些真实用户的网络场景恰好是同一出口IP的,可能会有“误伤“。但是这一个做法简单高效,根据实际场景使用可以获得很好的效果。

 

3. 多个账号,不同IP发送不同请求

所谓道高一尺,魔高一丈。有进攻,就会有防守,永不休止。这些“工作室”,发现你对单机IP请求频率有控制之后,他们也针对这种场景,想出了他们的“新进攻方案”,就是不断改变IP。

 

有同学会好奇,这些随机IP服务怎么来的。有一些是某些机构自己占据一批独立IP,然后做成一个随机代理IP的服务,有偿提供给这些“工作室”使用。还有一些更为黑暗一点的,就是通过木马黑掉普通用户的电脑,这个木马也不破坏用户电脑的正常运作,只做一件事情,就是转发IP包,普通用户的电脑被变成了IP代理出口。通过这种做法,黑客就拿到了大量的独立IP,然后搭建为随机IP服务,就是为了挣钱。

应对方案:

说实话,这种场景下的请求,和真实用户的行为,已经基本相同了,想做分辨很困难。再做进一步的限制很容易“误伤“真实用户,这个时候,通常只能通过设置业务门槛高来限制这种请求了,或者通过账号行为的”数据挖掘“来提前清理掉它们。

僵尸账号也还是有一些共同特征的,例如账号很可能属于同一个号码段甚至是连号的,活跃度不高,等级低,资料不全等等。根据这些特点,适当设置参与门槛,例如限制参与秒杀的账号等级。通过这些业务手段,也是可以过滤掉一些僵尸号。

4. 火车票的抢购

看到这里,同学们是否明白你为什么抢不到火车票?如果你只是老老实实地去抢票,真的很难。通过多账号的方式,火车票的黄牛将很多车票的名额占据,部分强大的黄牛,在处理验证码方面,更是“技高一筹“。

高级的黄牛刷票时,在识别验证码的时候使用真实的人,中间搭建一个展示验证码图片的中转软件服务,真人浏览图片并填写下真实验证码,返回给中转软件。对于这种方式,验证码的保护限制作用被废除了,目前也没有很好的解决方案。

 

因为火车票是根据身份证实名制的,这里还有一个火车票的转让操作方式。大致的操作方式,是先用买家的身份证开启一个抢票工具,持续发送请求,黄牛账号选择退票,然后黄牛买家成功通过自己的身份证购票成功。当一列车厢没有票了的时候,是没有很多人盯着看的,况且黄牛们的抢票工具也很强大,即使让我们看见有退票,我们也不一定能抢得过他们哈。 

 

最终,黄牛顺利将火车票转移到买家的身份证下。

解决方案:

并没有很好的解决方案,唯一可以动心思的也许是对账号数据进行“数据挖掘”,这些黄牛账号也是有一些共同特征的,例如经常抢票和退票,节假日异常活跃等等。将它们分析出来,再做进一步处理和甄别。

三、高并发下的数据安全

我们知道在多线程写入同一个文件的时候,会存现“线程安全”的问题(多个线程同时运行同一段代码,如果每次运行结果和单线程运行的结果是一样的,结果和预期相同,就是线程安全的)。如果是MySQL数据库,可以使用它自带的锁机制很好的解决问题,但是,在大规模并发的场景中,是不推荐使用MySQL的。秒杀和抢购的场景中,还有另外一个问题,就是“超发”,如果在这方面控制不慎,会产生发送过多的情况。我们也曾经听说过,某些电商搞抢购活动,买家成功拍下后,商家却不承认订单有效,拒绝发货。这里的问题,也许并不一定是商家奸诈,而是系统技术层面存在超发风险导致的。

1. 超发的原因

假设某个抢购场景中,我们一共只有100个商品,在最后一刻,我们已经消耗了99个商品,仅剩最后一个。这个时候,系统发来多个并发请求,这批请求读取到的商品余量都是99个,然后都通过了这一个余量判断,最终导致超发。(同文章前面说的场景)

 

在上面的这个图中,就导致了并发用户B也“抢购成功”,多让一个人获得了商品。这种场景,在高并发的情况下非常容易出现。

2. 悲观锁思路

解决线程安全的思路很多,可以从“悲观锁”的方向开始讨论。

悲观锁,也就是在修改数据的时候,采用锁定状态,排斥外部请求的修改。遇到加锁的状态,就必须等待。

 

虽然上述的方案的确解决了线程安全的问题,但是,别忘记,我们的场景是“高并发”。也就是说,会很多这样的修改请求,每个请求都需要等待“锁”,某些线程可能永远都没有机会抢到这个“锁”,这种请求就会死在那里。同时,这种请求会很多,瞬间增大系统的平均响应时间,结果是可用连接数被耗尽,系统陷入异常。

3. FIFO队列思路

那好,那么我们稍微修改一下上面的场景,我们直接将请求放入队列中的,采用FIFO(First Input First Output,先进先出),这样的话,我们就不会导致某些请求永远获取不到锁。看到这里,是不是有点强行将多线程变成单线程的感觉哈。

 

然后,我们现在解决了锁的问题,全部请求采用“先进先出”的队列方式来处理。那么新的问题来了,高并发的场景下,因为请求很多,很可能一瞬间将队列内存“撑爆”,然后系统又陷入到了异常状态。或者设计一个极大的内存队列,也是一种方案,但是,系统处理完一个队列内请求的速度根本无法和疯狂涌入队列中的数目相比。也就是说,队列内的请求会越积累越多,最终Web系统平均响应时候还是会大幅下降,系统还是陷入异常。

4. 乐观锁思路

这个时候,我们就可以讨论一下“乐观锁”的思路了。乐观锁,是相对于“悲观锁”采用更为宽松的加锁机制,大都是采用带版本号(Version)更新。实现就是,这个数据所有请求都有资格去修改,但会获得一个该数据的版本号,只有版本号符合的才能更新成功,其他的返回抢购失败。这样的话,我们就不需要考虑队列的问题,不过,它会增大CPU的计算开销。但是,综合来说,这是一个比较好的解决方案。

 

有很多软件和服务都“乐观锁”功能的支持,例如Redis中的watch就是其中之一。通过这个实现,我们保证了数据的安全。

四、小结

互联网正在高速发展,使用互联网服务的用户越多,高并发的场景也变得越来越多。电商秒杀和抢购,是两个比较典型的互联网高并发场景。虽然我们解决问题的具体技术方案可能千差万别,但是遇到的挑战却是相似的,因此解决问题的思路也异曲同工。

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326580343&siteId=291194637