Taobao Big Second System Design Detailed Explanation

Reprinted from: https://www.cnblogs.com/jifeng/p/5264268.html?from=timeline&isappinstalled=0

some data:

Do you still remember the Xiaomi spike in 2013? 110,000 units of each of the three Xiaomi phones went on sale, all using the big-second system. After 3 minutes, it became the first and fastest flagship store on Double Eleven. According to the log statistics, the front-end system effectively requests a QPS of more than 60w during the double 11 peak, while the cluster peak of the back-end cache is nearly 2000w/s, and the single machine is also nearly 30w/s, but the actual write traffic is much smaller. Single-reduction inventory tps is created by Redmi, reaching 1500/s.

Hotspot isolation:

The first principle of the seckill system design is to isolate this kind of hot data, so as not to let 1% of the requests affect the other 99%, and it is more convenient to optimize the 1% of the requests after isolation. For the spike, we have done multiple levels of isolation:

 

  • Business isolation. Turning seckill into a marketing activity, sellers need to register separately to participate in this kind of marketing activities. Technically speaking, sellers are known hot spots after registration, and we can warm up in advance when it really starts.

  • System isolation. System isolation is more of a runtime isolation, which can be separated from the other 99% by group deployment. The second kill also applied for a separate domain name, the purpose is to let the request fall into different clusters.

  • Data isolation. Most of the data called by Seckill is hot data. For example, a separate cache cluster or MySQL database will be enabled to store hot data. Currently, 0.01% of the data does not want to affect the other 99.99%.

Of course, there are many ways to achieve isolation. For example, you can distinguish according to users, assign different cookies to different users, and route to different service interfaces at the access layer; you can also set current limiting policies for different paths of URLs at the access layer. Wait. The service layer calls different service interfaces; the data layer can mark the data with special labels to distinguish. The purpose is to distinguish the identified hot spots from ordinary requests.

Dynamic and static separation:

The principle introduced at the system level is to isolate, and the next step is to separate hot data from static and dynamic, which is also an important principle for solving large-traffic systems. How to do the static transformation of the dynamic and static separation of the system I previously wrote an article "Static Architecture Design of High Access System", which introduced the static design ideas of Taobao commodity system in detail. If you are interested, you can refer to "Programmer" magazine Find it. Our big second system is developed from the commodity detail system, so it has realized the separation of dynamic and static, as shown in Figure 1.

In addition, there are the following features:

 

  • Cache the entire page in the user's browser

  • If the entire page is forced to refresh, it will also request to the CDN

  • The actual valid request is just the "Refresh Treasure" button

In this way, 90% of the static data is cached on the client or CDN. When the real spike occurs, the user only needs to click the special button to "refresh the treasure" without refreshing the entire page, so that only a few requests are made to the server. Valid data without the need to repeatedly request large amounts of static data. Compared with the dynamic data of the ordinary details page, the dynamic data of the spike is less, and the performance is more than 3 times higher than that of the ordinary details. Therefore, the design idea of ​​"refreshing and grabbing treasures" is a good solution to requesting the latest dynamic data from the server without refreshing the page.

Peak clipping based on time slicing

Anyone who is familiar with Taobao's second-killing system knows that the first version of the second-killing system itself does not have the function of answering questions, and the second-killing answering question is added later. Of course, a very important purpose of the second-killing answering question is to prevent the second-killing device. In 2011, when the second-killing device was very popular, the second-killing device was very popular. It is also relatively rampant, and it has not achieved the purpose of national participation and marketing, so the added answer to limit the spike. After adding the answers, the time for placing an order is basically controlled within 2s, and the ordering ratio of the seckill device also drops below 5%. The new answer page is shown in Figure 2.

 

In fact, there is another important function of adding answering questions, which is to lengthen the peak order request, from the previous 1s to about 2~10s. The request peak is based on time sharding, and the sharding at this time is for the service. It is very important to process concurrency on the side, which will reduce a lot of pressure. In addition, due to the order of requests, the later requests will naturally be out of stock, and the final ordering step will not be reached at all, so the real concurrent writing is very limited. In fact, this kind of design idea is also very common at present, such as Alipay's "咻一咻" and WeChat's Shake.

 

In addition to shaving traffic on the user side by answering questions at the front end, the server side generally uses locks or queues to control instantaneous requests.

Data Hierarchical Check

对大流量系统的数据做分层校验也是最重要的设计原则,所谓分层校验就是对大量的请求做成“漏斗”式设计,如图3所示:在不同层次尽可能把无效的请求过滤,“漏斗”的最末端才是有效的请求,要达到这个效果必须对数据做分层的校验,下面是一些原则:

 

  • 先做数据的动静分离

  • 将90%的数据缓存在客户端浏览器

  • 将动态请求的读数据Cache在Web端

  • 对读数据不做强一致性校验

  • 对写数据进行基于时间的合理分片

  • 对写请求做限流保护

  • 对写数据进行强一致性校验

秒杀系统正是按照这个原则设计的系统架构,如图4所示。

把大量静态不需要检验的数据放在离用户最近的地方;在前端读系统中检验一些基本信息,如用户是否具有秒杀资格、商品状态是否正常、用户答题是否正确、秒杀是否已经结束等;在写数据系统中再校验一些如是否是非法请求,营销等价物是否充足(淘金币等),写的数据一致性如检查库存是否还有等;最后在数据库层保证数据最终准确性,如库存不能减为负数。

实时热点发现:

其实秒杀系统本质是还是一个数据读的热点问题,而且是最简单一种,因为在文提到通过业务隔离,我们已能提前识别出这些热点数据,我们可以提前做一些保护,提前识别的热点数据处理起来还相对简单,比如分析历史成交记录发现哪些商品比较热门,分析用户的购物车记录也可以发现那些商品可能会比较好卖,这些都是可以提前分析出来的热点。比较困难的是那种我们提前发现不了突然成为热点的商品成为热点,这种就要通过实时热点数据分析了,目前我们设计可以在3s内发现交易链路上的实时热点数据,然后根据实时发现的热点数据每个系统做实时保护。 具体实现如下:

 

  • 构建一个异步的可以收集交易链路上各个中间件产品如Tengine、Tair缓存、HSF等本身的统计的热点key(Tengine和Tair缓存等中间件产品本身已经有热点统计模块)。

  • 建立一个热点上报和可以按照需求订阅的热点服务的下发规范,主要目的是通过交易链路上各个系统(详情、购物车、交易、优惠、库存、物流)访问的时间差,把上游已经发现的热点能够透传给下游系统,提前做好保护。比如大促高峰期详情系统是最早知道的,在统计接入层上Tengine模块统计的热点URL。

  • 将上游的系统收集到热点数据发送到热点服务台上,然后下游系统如交易系统就会知道哪些商品被频繁调用,然后做热点保护。如图5所示。

重要的几个:其中关键部分包括:

 

  • 这个热点服务后台抓取热点数据日志最好是异步的,一方面便于做到通用性,另一方面不影响业务系统和中间件产品的主流程。

  • 热点服务后台、现有各个中间件和应用在做的没有取代关系,每个中间件和应用还需要保护自己,热点服务后台提供一个收集热点数据提供热点订阅服务的统一规范和工具,便于把各个系统热点数据透明出来。

  • 热点发现要做到实时(3s内)。

关键技术及优化点:

前面介绍了一些如何设计大流量读系统中用到的原则,但是当这些手段都用了,还是有大流量涌入该如何处理呢?秒杀系统要解决几个关键问题。

 

Java处理大并发动态请求优化

 

其实Java和通用的Web服务器相比(Nginx或Apache)在处理大并发HTTP请求时要弱一点,所以一般我们都会对大流量的Web系统做静态化改造,让大部分请求和数据直接在Nginx服务器或者Web代理服务器(Varnish、Squid等)上直接返回(可以减少数据的序列化与反序列化),不要将请求落到Java层上,让Java层只处理很少数据量的动态请求,当然针对这些请求也有一些优化手段可以使用:

 

  • 直接使用Servlet处理请求。避免使用传统的MVC框架也许能绕过一大堆复杂且用处不大的处理逻辑,节省个1ms时间,当然这个取决于你对MVC框架的依赖程度。

  • 直接输出流数据。使用resp.getOutputStream()而不是resp.getWriter()可以省掉一些不变字符数据编码,也能提升性能;还有数据输出时也推荐使用JSON而不是模板引擎(一般都是解释执行)输出页面。

 

同一商品大并发读问题

 

你会说这个问题很容易解决,无非放到Tair缓存里面就行,集中式Tair缓存为了保证命中率,一般都会采用一致性Hash,所以同一个key会落到一台机器上,虽然我们的Tair缓存机器单台也能支撑30w/s的请求,但是像大秒这种级别的热点商品还远不够,那如何彻底解决这种单点瓶颈?答案是采用应用层的Localcache,即在秒杀系统的单机上缓存商品相关的数据,如何cache数据?也分动态和静态:

 

  • 像商品中的标题和描述这些本身不变的会在秒杀开始之前全量推送到秒杀机器上并一直缓存直到秒杀结束。

  • 像库存这种动态数据会采用被动失效的方式缓存一定时间(一般是数秒),失效后再去Tair缓存拉取最新的数据。

 

你可能会有疑问,像库存这种频繁更新数据一旦数据不一致会不会导致超卖?其实这就要用到我们前面介绍的读数据分层校验原则了,读的场景可以允许一定的脏数据,因为这里的误判只会导致少量一些原本已经没有库存的下单请求误认为还有库存而已,等到真正写数据时再保证最终的一致性。这样在数据的高可用性和一致性做平衡来解决这种高并发的数据读取问题。

 

同一数据大并发更新问题

 

解决大并发读问题采用Localcache和数据的分层校验的方式,但是无论如何像减库存这种大并发写还是避免不了,这也是秒杀这个场景下最核心的技术难题。

 

同一数据在数据库里肯定是一行存储(MySQL),所以会有大量的线程来竞争InnoDB行锁,当并发度越高时等待的线程也会越多,TPS会下降RT会上升,数据库的吞吐量会严重受到影响。说到这里会出现一个问题,就是单个热点商品会影响整个数据库的性能,就会出现我们不愿意看到的0.01%商品影响99.99%的商品,所以一个思路也是要遵循前面介绍第一个原则进行隔离,把热点商品放到单独的热点库中。但是无疑也会带来维护的麻烦(要做热点数据的动态迁移以及单独的数据库等)。

 

分离热点商品到单独的数据库还是没有解决并发锁的问题,要解决并发锁有两层办法。

 

  • 应用层做排队。按照商品维度设置队列顺序执行,这样能减少同一台机器对数据库同一行记录操作的并发度,同时也能控制单个商品占用数据库连接的数量,防止热点商品占用太多数据库连接。

  • 数据库层做排队。应用层只能做到单机排队,但应用机器数本身很多,这种排队方式控制并发仍然有限,所以如果能在数据库层做全局排队是最理想的,淘宝的数据库团队开发了针对这种MySQL的InnoDB层上的patch,可以做到数据库层上对单行记录做到并发排队,如图6所示。

你可能会问排队和锁竞争不要等待吗?有啥区别?如果熟悉MySQL会知道,InnoDB内部的死锁检测以及MySQL Server和InnoDB的切换会比较耗性能,淘宝的MySQL核心团队还做了很多其他方面的优化,如COMMIT_ON_SUCCESS和ROLLBACK_ON_FAIL的patch,配合在SQL里面加hint,在事务里不需要等待应用层提交COMMIT而在数据执行完最后一条SQL后直接根据TARGET_AFFECT_ROW结果提交或回滚,可以减少网络的等待时间(平均约0.7ms)。据我所知,目前阿里MySQL团队已将这些patch及提交给MySQL官方评审。

大促热点问题思考:

以秒杀这个典型系统为代表的热点问题根据多年经验我总结了些通用原则:隔离、动态分离、分层校验,必须从整个全链路来考虑和优化每个环节,除了优化系统提升性能,做好限流和保护也是必备的功课。

 

除去前面介绍的这些热点问题外,淘系还有多种其他数据热点问题:

 

  • 数据访问热点,比如Detail中对某些热点商品的访问度非常高,即使是Tair缓存这种Cache本身也有瓶颈问题,一旦请求量达到单机极限也会存在热点保护问题。有时看起来好像很容易解决,比如说做好限流就行,但你想想一旦某个热点触发了一台机器的限流阀值,那么这台机器Cache的数据都将无效,进而间接导致Cache被击穿,请求落地应用层数据库出现雪崩现象。这类问题需要与具体Cache产品结合才能有比较好的解决方案,这里提供一个通用的解决思路,就是在Cache的client端做本地Localcache,当发现热点数据时直接Cache在client里,而不要请求到Cache的Server。

  • 数据更新热点,更新问题除了前面介绍的热点隔离和排队处理之外,还有些场景,如对商品的lastmodifytime字段更新会非常频繁,在某些场景下这些多条SQL是可以合并的,一定时间内只执行最后一条SQL就行了,可以减少对数据库的update操作。另外热点商品的自动迁移,理论上也可以在数据路由层来完成,利用前面介绍的热点实时发现自动将热点从普通库里迁移出来放到单独的热点库中。

 

按照某种维度建的索引产生热点数据,比如实时搜索中按照商品维度关联评价数据,有些热点商品的评价非常多,导致搜索系统按照商品ID建评价数据的索引时内存已经放不下,交易维度关联订单信息也同样有这些问题。这类热点数据需要做数据散列,再增加一个维度,把数据重新组织。


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324605336&siteId=291194637