Demystified | The reason behind the release of "silky smooth" under high traffic scenarios

Introduction: Many Internet companies publish in the middle of the night, just to reduce the impact of users, and the scene of problems can be controlled. MSE service management is lossless and offline, which guarantees the traffic during the release period, allowing you to get rid of the embarrassment of midnight release.

Why do many Internet companies dare not publish during the day and choose to publish in the middle of the night. If it can get rid of the embarrassment of publishing in the middle of the night, will it not smell? Choosing to publish in the middle of the night is nothing more than to reduce the impact on users, and the impact of problems can be controlled.

Then let's talk about what problems will happen during the press release.

  • If your application does not have online or offline issues, any of your applications will cause short-term service unavailability during the release process, and a large number of io abnormal errors will appear in business monitoring within a short period of time, which will cause trouble to business continuity.
  • Release is the last link of the entire function update to the online. Some problems accumulated during the development process will only be triggered at the final release link. If it is a scene with a large flow during the day, the smallest problems will be quickly amplified due to the large flow, and the impact will be difficult to control.
  • If the release involves multiple applications, how to release it reasonably and without version issues that will cause traffic damage.

All the issues posted are roughly summarized into the above three items. Next, I will discuss in detail why these issues exist in several articles and how we have solved them. I also hope that everyone can get off work early, get rid of the embarrassment of publishing in the middle of the night, and spend more time with their families.

This article will focus on examples of online and offline scenarios and describe the issues in the release process.

Status of application release under high traffic

Application Demo

Demo Taking Spring Cloud as an example, we have prepared the following demo. The traffic is initiated by the Alibaba Cloud performance testing service PTS and flows into our system through the open source Zuul gateway

PTS use document: https://pts.console.aliyun.com/

The service call link is shown in the figure below:

image.png

In the figure, the traffic comes in from the Ingress corresponding to Netflix Zuul and will call the service corresponding to the SC-A application, the SC-A application internally calls the SC-B application service, and the SC-B application internally calls the SC-C application service.

Helm deployment demo

Helm install mse/mse-samples
Demo is a pure open source Spring Cloud architecture, project address:

https://github.com/aliyun/alibabacloud-microservice-demo/tree/master/microservice-doc-demo/traffic-management

After deployment, the workload on Alibaba Cloud Container Service is as follows:

image.png

We use the while true; do curl http:// {ip:port}/A/a;echo;done shell command to continuously access the Spring Cloud service. We can see that the function of our demo is only to print the IP of the current service. We can see that the function of our demo is only to print the IP of the current service, and we can see the overall call link.

while true; do curl http://{ip:port}/A/a;echo;done
A[10.0.0.81] -> B[10.0.0.82] -> C[10.0.0.68]
A[10.0.0.179] -> B[10.0.0.82] -> C[10.0.0.50]
A[10.0.0.80] -> B[10.0.0.82] -> C[10.0.0.68]
A[10.0.0.49] -> B[10.0.0.82] -> C[10.0.0.50]
A[10.0.0.81] -> B[10.0.0.175] -> C[10.0.0.68]
A[10.0.0.179] -> B[10.0.0.175] -> C[10.0.0.50]
A[10.0.0.80] -> B[10.0.0.175] -> C[10.0.0.68]
A[10.0.0.49] -> B[10.0.0.175] -> C[10.0.0.50]
A[10.0.0.81] -> B[10.0.0.82] -> C[10.0.0.68]
...

Configure the pressure test to 500 qps, and perform application scaling, expansion and release during the pressure test, and observe the pressure test situation.

The performance of open source applications under high traffic

Shrink

In the case of 500qps pressure test, the sc-a application is reduced from 4 pods to 1 pod, and the pressure test takes 3 minutes.

  1. Observing the event of K8s, we see that at 17:35:21, the application is scaled down.
    image.png
  2. Looking at the performance stress test report, we observed that the error started at 17:35:21 and stopped at 17:35:36. The error lasted for 15 seconds, and a total of 469 exceptions occurred.
    image.png
  3. The detailed process report is as follows.
    image.png

Expansion

Let's take a look at the performance of application expansion in the stress test state. We expand the sc-a application from 1 pod to 4 pods in the case of 500qps stress test, and the stress test time is 3 minutes

  1. Observing the K8s event, we see that at 17:47:03, the application is expanded.
    image.png
  2. Looking at the performance stress test report, we observed that the error started at 17:47:12 and stopped at 17:47:19. The error lasted for 7 seconds, and a total of 257 exceptions occurred.
    image.png
  3. The detailed process report is as follows.
    image.png

release

In the case of 500qps pressure test, the sc-a application (4 pods) is released, and the pressure test lasts for 3 minutes.

  1. Observing the K8s event, we see that the application is released at 17:53:42.
    image.png
  2. Looking at the performance stress test report, we observed that the error started at 17:53:42 and stopped at 17:54:24. The error lasted for 42 seconds, and there were more than 10,000 exceptions.
    image.png
  3. The detailed process report is as follows.
    image.png

Status Quo and Thinking

It can be seen that the issue of application release under high traffic is urgent. With the development of cloud-native architecture, cloud-native capabilities such as elastic scaling, rolling upgrades, and batch releases allow users to obtain the optimal solution for resources, costs, and stability. It is precisely because of its flexibility and other characteristics that if the application is online or offline Issues such as the Internet, these issues will be magnified under the cloud native architecture.

Imagine that if there are unnecessary errors in every expansion, reduction, and release, business continuity and product user experience will receive a huge blow. How to ensure that the business is unaware during the service update deployment process is development The problem that the user must solve, that is, from stopping the application to re-running, cannot affect normal business requests.

Reducing unnecessary API errors is the best customer experience.

This is a very painful point. At this time, someone tells you that I know how to fix it. I have rich experience and know how to solve it. You must be very happy.

Then came in with a high salary. It's really good. The various architecture diagrams, framework principles, and framework modifications are very clear and the functions are really perfect. Finally, to evaluate the cost of modifying the current system, it is necessary to build three sets of middleware servers, increase 4 middleware dependencies, and modify tens of thousands of lines of code and configuration.

"Excuse me, it's still important for the business. The requirements given by the product manager have not yet been fulfilled. The scenario just mentioned is not that painful. There are just a few minor issues. It's really fine."

At this time, MSE tells you that MSE's microservice solution does not require any code and configuration modification to perfectly solve the problems in the online and offline. You only need to connect your application to MSE service management, you can enjoy the lossless offline ability of MSE.

Are you not moved?

Yes, you read that right. As long as your application is based on Spring Cloud or Dubbo's version development within the last five years, you can directly use the complete MSE microservice management capabilities without modifying any code or configuration.

Release of apps with lossless offline

How to access MSE losslessly offline

You only need to connect your application to MSE service governance to have the lossless offline capability of microservice governance.

Performance after access

Let’s take a look at the expansion and contraction and release performance after accessing MSE service governance, which is also the original

Shrink

In the case of 500qps pressure test, the sc-a application is reduced from 4 pods to 1 pod, and the pressure test takes 3 minutes

  1. Observing the K8s event, we see that at 17:41:06, the application is scaled down.
    image.png
  2. Checking the performance stress test report, we observed that the flow was undamaged throughout the process, and the concurrency was stable at around 30.
    image.png
  3. The detailed process report is as follows, you can see that application scaling is completely imperceptible to the business.
    image.png

Expansion

In the case of 500qps pressure test, the sc-a application is expanded from 1 pod to 4 pods, and the pressure test takes 3 minutes.

  1. Observing the K8s event, we see that at 20:00:19, the application is expanded.
    image.png
  2. Check the performance pressure test report, no error is reported.
    image.png
  3. The detailed process report is as follows. It can be seen that there is no error report for the application shrinkage for the business, but there are concurrency bumps at 20:01:07, and the lossless online function will be launched later. This logic will be improved to smooth the bumps.
    image.png

release

In the case of 500qps pressure test, the sc-a application (4 pods) is released, and the pressure test lasts for 3 minutes.

  1. Observing the events of K8s, we see that the application is released at 20:08:55.
    image.png
  2. Checking the performance pressure test report, we observed that there was no error reported throughout the flow.
    image.png
  3. The detailed process report is as follows. It can be seen that there is no error report for the application shrinkage for the business, but there is a slight bump in the concurrency at 20:09:39, 20:10:27, and the lossless online function will be launched later, which will improve the logic , Make the bumps smooth.
    image.png

Comparing the performance of applications in the release process before and after access to MSE service management, we can see that MSE completely solves the pain points of traffic error reporting during release and expansion, making the business more stable and the product experience smoother. At the same time, after accessing the MSE service management, you can enjoy the lossless offline ability without modifying a line of code.

to sum up

This article introduces the ability of lossless offline under microservice governance, guarantees the traffic during the release period, and allows you to get rid of the dilemma of midnight release. Your application only needs to access MSE service governance, and you can enjoy lossless offline without any operations. ability. In addition to MSE (microservice engine), lossless capabilities are also integrated by cloud products such as EDAS and SAE. At the same time, lossless offline has been implemented in Alibaba Cloud's core business on a large scale, helping to ensure the stability of the cloud business and keep your business online forever.

The following chapters will explain in detail why you only need to access MSE service management, your application can release the dark magic that is still silky smooth under heavy traffic during the day, so stay tuned

Later, I will continue to talk about the scene of silky release under heavy traffic during the day. It is expected that there will be three to four articles on this topic, so stay tuned!

Not just service governance

The MSE microservice engine not only has microservice governance capabilities, but we also provide services such as hosting an open source registry, configuration center, and open source gateway. Through managed Baas-based products, we export Alibaba Cloud's best practice capabilities for more than ten years of microservices through cloud products to help ensure the stability of the cloud business and keep your business online forever.

Microservice engine user exchange group

If you have any questions during the use of the microservice engine MSE, you are welcome to search Dingding group number 23371469 or use Dingding to scan the following QR code to join Dingding group for feedback.

image.png

Original link: https://developer.aliyun.com/article/780231?

Copyright statement: The content of this article is voluntarily contributed by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find that there is suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

Guess you like

Origin blog.csdn.net/alitech2017/article/details/112465964