2019 Taobao: OceanBase distributed load balancing systems Case Studies

Abstract: Heroku makes us aware of the problem, the problem was found success in load balancing test and properly addressed there? So, dig out the "Taobao found in the two-eleven pressure measurement OB random access cause serious problem of uneven load and properly resolved through weighted algorithm" success stories, is this article.

In the article CSDN cloud computing channel recently made "in response to up to six seconds to expose the Heroku users without permission to modify routing resulting in high expenses", the netizens thought it was "because random scheduling + Rails single-threaded processing adds latency load balancing fail case. " But in the load balancing test can find problems and proper settlement of success there? In the subsequent micro-Bo, Alipay @Leverly comments:. "OB pressure measured before double 11 last year found serious problems leading to uneven load of random access, but fortunately, a good solution by weighting algorithm" led our concern, so with this article. Focus is on Taobao "double October" behind, OceanBase distributed load balancing system of experience sharing.

Cloud computing challenges included low cost, high performance, high availability, high scalability and other features and Internet applications increasingly faced coincides become a hot topic in the field of the Internet in recent years. As a technician is not difficult to understand the underlying architecture of cloud computing, distributed storage is an indispensable part. Foreign well-known Internet companies such as Google, Amazon, Facebook, Microsoft, Yahoo have launched their own distributed storage system in the country OceanBase Taobao self-developed high-performance distributed database system that supports massive data, realized thousands million records, cross-bank transactions across the table [1] on hundreds of TB of data.

Exist in a distributed system with the famous "short-plate theory." [2], a cluster if the load imbalance problem, then the maximum load machines tend to be the bottleneck in the system affect the overall performance and the short board. To prevent this from happening, we need dynamic load balancing mechanisms to achieve real-time to maximize resource utilization, improve overall system throughput. Here I recommend a framework for the exchange of learning circle to you. Penguin exchange study circle number: 948 368 769 which will share some of the video recording of a senior architect: There are Spring, MyBatis, Netty source code analysis, high concurrency, high performance, distributed, the principle of micro-services architecture, JVM performance optimization, distributed architecture these and other knowledge necessary to become an architect. Also receive free learning resources, currently benefited

This will combine the practical application OceanBase of last year and share a two-eleven Taobao ready to work early in the encounter load balancing related cases, initiate the expectations of everyone's work to inspire.

OceanBase architecture introduced

OceanBase is autonomous distributed storage system having a function by RootServer central node, the node chunkserver static data, dynamic data, and a data merge node UpdateServer four nodes MergeServer Server configuration [1], as shown in FIG.

image

image.gif

Tablet: sliced ​​data, the basic memory cell typically stores a plurality of parts, consists of a plurality of Table Tablet;

RootServer: Management is responsible for a cluster of machines, Tablet positioning data load balancing, Schema and other metadata management.

UpdateServer: responsible for storing dynamically update data storage medium for the memory and SSD, provide external writing services;

ChunkServer: Tablet responsible for storing static data storage medium as an ordinary disk or SSD.

MergeServer: responsible for queries involving multiple data merge Tablet, provide external reading services;

In one cluster, the plurality of copies are stored in Tablet chunkserver different, each is responsible for a portion of the Tablet chunkserver sliced ​​data, MergeServer and chunkserver usually deployed together.

Two-eleven preparation

For most applications Taobao "double-October" is the annual time line pressure measurement. Along with traffic constantly refresh the record high, it presented a great challenge to the scalability of each system. In order to face the two-eleven on each product line may become part of the traffic bottlenecks and expansion were estimated to be urgent task. In case you want to share in this article, the application side based on historical data estimated peak read access request is 7w QPS, about 5-6 times normal, the total support of 5.6 billion times read requests per day. At that time OceanBase cluster deployment size is 36 servers, storage total amount of 20 billion rows of data, supports 24 million times read requests per day.

The current reading performance of the cluster can not meet the demand, we first conducted an expansion, on the line 10 Chunkserver / Mergeserver server. Since OceanBase itself has strong scalability, plus a cluster machine is a very simple operation. Rootserver central node registration in the new on-line machine, in function starts Rebalance Tablet static data in units of data migration, a schematic diagram below, the final equilibrium distribution of all the data fragment ChunkServer.

image

image.gif

image

image.gif

image

image.gif

After completion of the introduction of online traffic expansion playback mechanism to conduct stress tests in order to verify whether the performance of two-eleven needs to meet the current cluster applications. We used 10 servers, a total of 2000-4000 concurrent threads read online playback of traffic to cluster pressure measurement, and soon found a whole cluster QPS after reaching 4 million, a large number of pressure measuring client timeouts, the average response delay threshold has been exceeded 100ms, even if the pressure constantly adjusted, the whole system is also no QPS is increased. At this point observe the entire cluster load state of the machine and found that only a very few server load high, is about four times the other machines, and other basic machine is idle, CPU, network, disk IO have highlighted the serious imbalance.

Load imbalance has led to the highest overall throughput depends on the load that Taiwan Server, a typical "short plate theory," which is the previously mentioned problems.

The problem of uneven load tracking

After the client is connected to a read request to read OceanBase flow as shown below:

image

image.gif

Client acquired from RootServer to MergeServer list;

Client sends a request to a station MergeServer;

MergeServer从RootServer获取请求对应的ChunkServer位置信息;

MergeServer将请求按照Tablet拆分成多个子请求发送到对应的ChunkServer;

ChunkServer向UpdateServer请求最新的动态数据,与静态数据进行合并;

MergeServer合并所有子请求的数据,返回给Client;

OceanBase的读请求流程看起来如此复杂,实际上第1步和第3步中Client与RootServer以及MergeServer与RootServer的两次交互会利用缓存机制来避免,即提高了效率,同时也极大降低了RootServer的负载。

分析以上的流程可知,在第2步客户端选择MergeServer时如果调度不均衡会导致某台MergeServer机器过载;在第4步MergeServer把子请求发送到数据所在的ChunkServer时,由于每个tablet会有多个副本,选择副本的策略如果不均衡也会造成ChunkServer机器过载。由于集群部署会在同一台机器会同时启动ChunkServer和MergeServer,无法简单区分过载的模块。通过查看OceanBase内部各模块的提供的监控信息比如QPS、Cache命中率、磁盘IO数量等,发现负载不均问题是由第二个调度问题引发,即MergeServer对ChunkServer的访问出现了不均衡导致了部分ChunkServer的过载。

ChunkServer是存储静态Tablet分片数据的节点,分析其负载不均的原因包含如下可能:

数据不均衡: ChunkServer上数据大小的分布是不均衡的,比如某些节点因为存储Tablet分片数据量多少的差异性而造成的不均衡;

流量不均衡:数据即使是基本均衡的情况下,仍然会因为某些节点存在数据热点等原因而造成流量是不均衡的。

通过对RootServer管理的所有tablet数据分片所在位置信息Metadata进行统计,我们发现各个ChunkServer上的tablet数据量差异不大,这同时也说明扩容加入新的Server之后,集群的Rebalance是有效的(后来我们在其他应用的集群也发现了存在数据不均衡问题,本文暂不解释)。

尽管排除了数据不均衡问题,流量不均衡又存在如下的几种可能性:

存在访问热点:比如热销的商品,这些热点数据会导致ChunkServer成为访问热点,造成了负载不均;

请求差异性较大:系统负载和处理请求所耗费的CPU\Memory\磁盘IO资源成正比,而资源的耗费一般又和处理的数据量是成正比的,即可能是因为存在某些大用户而导致没有数据访问热点的情况下,负载仍然是不均衡的。

经过如上的分析至少已经确定ChunkServer流量不均衡问题和步骤4紧密相关的,而目前所采用的tablet副本选择的策略是随机法。一般而言随机化的负载均衡策略简单、高效、无状态,结合业务场景的特点进行分析,热点数据所占的比例并不会太高,把ChunkServer上的Tablet按照访问次数进行统计也发现并没有超乎想象的“大热点”,基本服从正太分布。在此我向大家推荐一个架构学习交流圈。交流学习企鹅圈号:948368769 里面会分享一些资深架构师录制的视频录像:有Spring,MyBatis,Netty源码分析,高并发、高性能、分布式、微服务架构的原理,JVM性能优化、分布式架构等这些成为架构师必备的知识体系。还能领取免费的学习资源,目前受益良多

可见热点Tablet虽访问频率稍高对负载的贡献率相对较大,但是热点tablet的占比很低,相反所有非热点tablet对负载的贡献率总和还是很高的,这种情况就好比“长尾效应”[3]。

负载均衡算法设计

If the non-access hotspots on hotspots ChunkServer Tablet dispatched to other Server, it will help ease the problem of uneven flow, so we designed a new load balancing algorithm: the number of visits to all tablet on real-time statistics of ChunkServer as Ticket, every Tablet second reading of a copy of the request will choose the lowest ChunkServer vote.

The second reason for taking into account the flow rate imbalance problem is quite different requests, provide external interfaces ChunkServer Get and Scan into two kinds, a scan line Scan all data range, is to obtain the specified row Get data, the number of two connections need to be divided imparting weight (α, β) different weights involved in the final calculation Ticket:

image

image.gif

In addition, the simple distinction between two access modes is not enough, different Scan occupied resources are quite different, and this important factor in the introduction of the average response time (avg_time) is also very necessary:

image

image.gif

Load balancing algorithm requires a strong adaptability and real-time, on the one hand the new access to real-time cumulative load balancing schedule to participate in the next, on the other hand the historical weight data will need to be based on statistical nonlinear attenuation period (y attenuation factor), reduce the impact on real-time:

image

image.gif

With the new algorithm, well ease the load imbalance problem, enhance the overall load doubled, overall throughput QPS upgrade to 8w.

summary

Load balancing problem is a common problem, not solved there will be "short board effect", but even lead to a chain reaction that is distributed system "avalanche", which evolved into a system of disaster. Load balancing algorithms are endless, and some for cost-optimized, some delays to a minimum, while others are to maximize system throughput, the purpose of different algorithms naturally vary, there is no cure-all remedy, not the more complex the more efficient algorithm [4], and comprehensive consideration of the required data acquisition algorithm Overhead, more is to follow the "simple and practical" rules, to analyze and try based on business scenarios.

It is this flexibility strategy, our system design put forward new demands, there should be some mechanism to monitor and verify the problem: for example, you can get a variety of internal states and data in real-time operating system, allows the selection of different load balancing algorithm test.

Guess you like

Origin blog.51cto.com/14422312/2415679