Introduction to Requirements Analysis: Talking about Architecture (3) Usability Topics

The previous article introduced various indicators of non-functional requirements and some industry standards. There is a reliability
in the non-functional requirements , and an indicator associated with it is called usability. This article makes some detailed descriptions of the availability and reliability in the non-functional requirements.

concept

When we are at cloud service providers on the Internet, we often see this kind of words in product introductions: Our service availability is as high as 99.99%.
What does this availability mean?

  • Definition:
    Refers to the proportion of time that the system runs normally and can be used normally within a period of time.
    For example, if a system operates normally for 364 days within a year, and the accumulated failure time is 1 day, then the availability is364/365 ≈ 99.7%
  • The difference with reliability:
    reliability is the average interval between two failures.
  • Usually, the two are related, that is, good reliability usually has high availability.
    However, there are exceptions, such as the comparison of the following two scenarios:
    • High availability but poor reliability.
      There will be a downtime every minute, and it will return to normal in 1 second. The availability is relatively good, ≈ 98.3%
      but the probability of failure is high. The continuous normal service time is only 59 seconds, so the reliability is poor.
    • The reliability is good but the availability is not high. The
      downtime is once a day for 2 hours each time. The availability is worse than the above scenario. ≈ 91.7%
      But the reliability is better than the previous scenario. Its continuous normal service time is 22 hours

measure

There are usually two ways to measure the availability of a software system:

time based

Uptime / (Uptime+Downtime)
is the normal working time, divided by the total time

based on request

Success / Total
is the number of successful responses, divided by the total number of requests

In the following two pictures, the left side lists the average failure time of a system when the availability is achieved, and the right side shows the availability of common cloud service commitments: From the above picture, we can see that
insert image description here
:

  • The lower the availability, the greater the impact on users, and the more faults, the greater the probability of user complaints, complaints, and loss
  • The availability promised by cloud service vendors is not high. When designing your system, you must also consider the impact of cloud service vendor failures.
    Reference:
    • Alibaba Cloud SLA service level agreement statement: https://help.aliyun.com/document_detail/56773.htm
      The availability of SMS service is only promised to be 95%, and the availability of storage can reach 99.995%
    • Microsoft Cloud Azure: https://azure.microsoft.com/zh-cn/support/legal/sla/

Digression:

  • When a cloud provider fails, as long as the duration does not exceed the promised availability, it is basically just an apology; even if it exceeds, it is generally only compensated for the corresponding duration of the failure. For example, if the failure lasts for 1 hour, you will be compensated for 3 hours of cloud service usage time, nothing more. .
    Therefore, if your data is lost, if there is no big trouble, you generally have to figure out a solution by yourself.
    It can be searched: In 2018, the frontier CNC data loss incident occurred in Tencent Cloud, and the claim was tens of millions, but only 130,000 was paid.

  • The SaaS service of my previous company was deployed on Alibaba Cloud. When the service failed, the marketing personnel first told the user: Alibaba Cloud was down again, and then complained to the boss about the R&D personnel

  • The availability of most web services is 3 9 or below

How to improve usability

case analysis

Suppose a system has a login request and needs to access 4 DB databases (mysql/mongodb/cassandra/redis). It is known that the availability of each DB is 99.9%, as shown in the figure: What is the availability of this request
insert image description here
?
The actual availability of this serial system request: 99.9% to the 4th power = 99.6%, which means:
a single DB is 99.9% available, that is, there will be 8.76 hours of unavailability throughout the year;
but after 4 DBs make serial requests , becoming unavailable for 35.04 hours throughout the year, and the probability of failure has quadrupled.

Moreover, there are other servers and networks in the figure. Assuming that the network is unstable and the availability of each DB is reduced to 99.5%, then: the availability of
this series system request is reduced to: 99.5% to the 4th power = 98%
is equivalent to the whole year There are 175.2 hours of unavailability, an average of 14.6 hours of downtime per month

Ways to Improve Usability

From the above case, it can be concluded that the more modules (microservices) and middleware (database, message queue, etc.) involved in a system, the lower the availability.
So how to improve usability?
The only answer is to use parallel connection, that is, service redundancy, which is what we often call load balancing;

  • Availability calculation of series system (2 nodes): A = p1 * p2
    If the availability of both nodes is 99%, then A = 99% * 99% = 98.01%
  • Availability calculation of parallel system (2 nodes): A = 1-(1-p1)*(1-p2)
    If the availability of both nodes is 99%, then A = 1-(1-99%) * ( 1-99%) = 99.99%
    redundant one node, availability is enough 从99%提升到99.99%, from年故障87.6小时,降低到52分钟

How to realize system parallel connection:

  • For each service, it is deployed on multiple independent servers;
  • There is a gateway at the front end, which receives the request and sends it to the healthy node according to the status of each node of the service
  • The gateway itself also needs to be redundant. Usually, a set of election mechanism is used to ensure the health status of the gateway itself. Common election mechanisms include Paxos algorithm and Raft algorithm, which can be searched by itself.

Difficulties:

  • The architecture design is complex and involves the transformation of old systems. Some systems do not support parallel connection, such as using memory cache, using Session, etc.;
  • To double the cost, it is necessary to comprehensively consider the product market and development costs, deployment costs, and subsequent operation and maintenance costs;
  • When using containers, such as K8S, you should pay attention to the fact that multiple instances (Pods) of the same service must be deployed on different working nodes;
    I have encountered two pods of a service on the same node, and the result of this node It goes down and the entire service is unavailable.

Identify optimization goals

To improve the availability of the system, instead of blindly adding parallel processing to each service and each middleware, certain steps should be followed:

  • 1. Determine Appropriate Usability Goals
    • Research user expectations
    • Determine the company's business goals (at least 99.9% of ToB products)
    • Refer to the scale and service level of competing products
  • 2. Measuring our system
    Common measurements, determining indicators, performing buried points and data collection (such as Proetheus), and then performing statistical calculations, statistical results:
    avg, max, min, dev (average difference (∑|xx'|)÷ n ), long tail (95th, 99th)

usability epilogue

Refer to the figure below, which indicates the three stages of a product operation:
insert image description here
Through this figure, I want to say:
No matter how awesome X’s architecture and middleware are, they are gradually evolved and precipitated step by step. At the beginning, everyone is very human. . .

Continuous refactoring...

In the next article, I will introduce performance-related concepts and some methods of how to find performance problems

Guess you like

Origin blog.csdn.net/youbl/article/details/131325099