Performance testing from scratch Implementation Guide - capacity assessment papers

Probably this time last year, wrote a blog: On capacity testing and capacity planning , which talked some of my personal thoughts on some knowledge and capacity testing and capacity planning.

Since this year, our company wants to double eleven big promotion, so the whole link voltage measured in a very important part - the capacity testing and capacity planning are included in the to-do.

By contrast, want the right to test the capacity to provide important reference, capacity assessment, is what we need to do things the preparatory phase of online capacity planning. How to do it? ? ?

This blog, I briefly during the preparation phase, is how to carry out capacity assessments as well as some of the problems encountered and solutions. . .

 

Capacity Assessment nine steps to go - flow chart

 

First, the division of traffic sources

In the capacity assessment stage, first thing to do is to divide traffic sources, and that needs to be divided according to the specific characteristics of the business. Generally divided into the following three sources:

1, PC end : the electric business platform as an example (Taobao, Jingdong, fight a lot ......), referring to the request traffic from the PC end user initiated;

2, the mobile end : end mobile here including phones, tablets and other mobile devices (mobile terminal currently the largest proportion of traffic is a traffic sources);

3, the applet : In recent years with the rise of small programs, applets, and traffic from the H5 can not be ignored part of the flow channel;

Blackboard knock : if in order to more accurately refine the traffic division, but also according to region (domestic / foreign, free shipping Area / remote areas) to divide traffic sources, the purpose of doing so is to be allocated according to the engine room area and DNS Network Configuration!

Question : how to monitor traffic in different regions? Specialized solutions (monitoring Po), log analysis performed according to the request address related data, generating a monitoring hot FIG (grafana monitoring market);

 

Second, confirm the type of statistics

Here statistic types from the perspective of the system architecture is divided, depending on the system architecture, the technology component to confirm the proportional flow landing, divided into four types:

1, DB capacity : Specifically, such as MySQL cluster, different service database last hour peak QPS (data acquisition requires a combination of scene and whether or not a sub-library sub-table, separated from the main configuration);

2, service capacity : If the service is integrated, there is no need to consider business division; or if the service type is a micro-type SOA is required depending on the service business split, the capacity statistics (taking into account the need to rely on the service);

Knock on the blackboard : assessment services capacity (index or QPS), also need to configure a stand-alone service instance statistics, the current number of machines production environment!

3, the capacity of the message : message mainly refers to the message queue, such as MQ, kafka (also need to be divided according to the service attributes).

Knock on the blackboard : Message capacity statistics, the main statistical categories this value: cluster type, Topic, ConsumeGroup, total messages, and multiple daily, whether accumulation, peak QPS !

4, cache size : Cache here refers to the Redis (CDN has yet to come into contact with me, do not make an overview), likewise, we need to be vertically divided according to different business.

Knock blackboard : when capacity assessment, taking into account the need of Redis instance configuration mode (Sentinel / cluster) peak QPS, storage capacity, the number of machines available region (DR) !

Problem : it comes to hot Key, Key big problem, we recommend that large Key management in advance, hot Key hash distribution (remember to check the session persistence policies)!

 

Third, access to the monitoring component

1、Cat

①, Introduction: CAT is based on real-time monitoring of Java development platform, including mobile end monitoring, application monitoring side, the core network layer monitoring, system level monitoring. Provides real-time alarm monitoring, application performance analysis tool for diagnosis.

②, features: can be found here: public comment CAT open source monitoring system analysis

2、Jeager

①、简介:open source, end-to-end distributed tracing.

②, Chart

3、Sentinel

①, Introduction: Ali middleware team open source, lightweight high availability flow control assembly for distributed services architecture, mainly in the flow as the starting point, the flow control, fuse downgrade, the system load in the dimension of protection to help users protect the stability of the service.

②, Chart

③, focus

Diversification flow control;

Fuse downgrade;

Protection System (LOAD, RT, number of threads, the inlet QPS, CPU usage);

Real-time monitoring and console configuration;

4、Prometheus

①, Description: open source system monitoring and alerting framework, inspired by Google's Borgmon monitoring system. In 2012, SoundCloud's former Google employees created Prometheus, and developed as an open source community project.

By 2015, the project officially released. 2016, Prometheus added native cloud computing Foundation ( Cloud Computing Foundation Native ), became the project's popularity second only to Kubernetes.

②, characteristics

Multidimensional data model (based on time series Key / Value key-value pair);

Flexible query and aggregation language PromQL;

Provide local storage and distributed storage;

Pull model by HTTP-based data acquisition time series;

Available Pushgateway (Prometheus optional intermediate) Push mode implemented;

Or static configuration can be found by finding the target machine dynamic service;

It supports a variety of charts and data market;

 

Fourth, the scene select collection

Data Acquisition select scenes, there is a strong dependence of the core link carding, recommended in the following three ways.

1, daily peak

Select the production environment daily peak traffic statistics, the peak here refers to the peak interval, the interval can generally choose 30min;

2, the core link

About the core link comb, you can refer to previous blog: performance testing from scratch Implementation Guide - scene model articles . Diagram is as follows:

3. Push the full amount

对于电商业务而言,经常会有一些消息或者活动推送的玩法,建议选择在活动推送期间的峰值流量来作为数据采集场景的流量参考;

敲黑板:全量推送后会有一小段的高峰流量涌入,会对整个系统服务产生一定的影响!

 

五、汇总流量数据

流量统计表格Mode如下,仅供参考:

1、服务容量

2、消息容量

3、缓存容量

4、DB容量

 

六、获取投放引流

运营投放引流的渠道、力度以及转化率是很重要的一个参考指标,可以让我们对大促时期的预期流量有更准确的预估。主要从如下三点来考虑:

1、时段

一般来说,电商这种大促,都是从月初持续到活动当天,不断蓄水炒氛围,活动当天流量达到峰值,然后有2-3天的返场,总体来说时间大概为半个月左右。

获取到整个活动期间每个时间段有哪些活动,目的是确定峰值流量冲击的时间段,重点关注监控;

2、类型

主要是上述的时间段内,有哪些运营活动,比如:秒杀(超卖场景)、抢购(热点key的问题)、签到、抽奖、分享等;

3、量级

量级主要分为全量推送、特定用户推送、推送触达率、返场转化率等指标,这样方便我们更好的评估实时的流量峰值;

问题:为什么要获取运营投放和引流的数据呢?——为了更精准的评估峰值流量,针对性的部署演练专项预案!

 

七、确定验收水位

验收水位的作用,主要从以下两方面考虑:

1、监控告警阈值

确定运维保障的线上监控告警阈值,针对流量冲击,进行针对性的自动扩容;

2、资源可用缓冲

服务的处理能力是有限的,而且为了保障服务的稳定可用性,不能让服务器持续处于高负载的状态,因此要提前预留一定的资源可用比率,作为缓冲区

达到或超过运维的告警监控阈值,则自动扩容或者触发限流策略。因此最终的性能验收水位,要结合上述两点来综合考虑。

如果能对流量做到精准控制运维的自动化程度比较高的话,可以以单机的50%资源使用率作为扩容依据(淘宝貌似就是这个值)。

如果没有太精细化的控制,运维自动化程度不太高,建议以40%来作为验收水位。

 

八、执行容量测试

执行容量测试,应该是执行阶段要做的事情,由于容量测试测定的单机水位对容量评估和容量规划是承上启下的连接点,因此这里顺带提及一下。

容量测试的目的,就是获取单机容量(什么状态什么阈值下的容量,和上述第七点结合)!

 

九、线上容量规划

前面做了这么多准备工作,最终的目的是对线上容量规划有准确的参考和实施依据。容量规划常规的计算公式如下:

A服务单机容量在50%水位时,TPS=200,设定为T;线上流量转化预估TPS为3000,设定为S;为保障服务高可用,预留30%机器资源做扩容buffer,设定为B;

那么A服务最终线上需要部署的机器数量的计算公式为:Count(A)= (1+30%)*(S/T)= 19.5台机器;取整,那么服务A线上容量规划时,需要部署20台机器。

 

最后,别忘了在线上针对性的进行高可用验证!!!

 

Guess you like

Origin www.cnblogs.com/imyalost/p/11623716.html