TOP100summit: [Shared Record-Huawei] Best Practices for Performance Improvement in Microservice Scenarios

The content of this article comes from the case sharing of Wang Qijun, senior architect of Huawei's Architecture Department at the 2016 TOP100summit.
Editor: Cynthia

Wang Qijun: Senior Architect of Huawei Architecture Department. Responsible for the implementation of Huawei's cloudification and micro-service architecture, and participated in the architecture design of Huawei Mobile Cloud 4.0 and IoT 2.0. He used to be the architect of Dangdang.com, leading the architecture design of e-commerce platform, including order, payment, price, inventory, logistics, etc. He once worked at Sohu and was responsible for the research and development of mobile Weibo. "Running Snail" public account blogger.

Introduction: With the advent of the cloud era, the software architecture is changing with each passing day, and various new technologies emerge in an endless stream. The term "microservice" is in full swing and has been widely recognized by the industry. However, microservices are not a silver bullet, and performance issues under the microservice architecture are particularly important for companies that deliver projects.
This article mainly focuses on the performance problems under microservices. The purpose is to analyze how to evaluate and improve the overall performance after the service is split to a certain granularity and the service call chain becomes longer through the actual problem solving process.

1. Questions

In the architectural design phase, the basic elements that everyone is more concerned about include: performance, availability, consistency, scalability, and security. Among them, performance issues are often overlooked in the initial stage, but with the gradual evolution of the architecture and the increasing scale, performance issues will become more and more important. It is possible that a small change can save half of the server resources. .

What is the performance in the end?
● One is the response time, that is, the time it takes to send the request and return the result;
● the other is the throughput, that is, the number of responses per unit time.

Of course, these two indicators are only meaningful under certain resource constraints. For example, how much disk, how much CPU, how much memory and so on. There is also an interaction between throughput and response time, although this is not absolute.

Common performance issues include:

● Memory leaks - causing memory exhaustion
● Overloading - burst traffic, lots of timeout retries
● Network bottleneck - too much content to load
● Blocking - endless waiting
● Locks - passing throttling
● IO busy - A large number of reads and writes, distributed
● CPU busy - common problem of computing
● Long request congestion - connection exhaustion

When throughput is a problem, we expect to scale through threads or processes. The thread method is much more complicated than the process method, because the thread method needs to consider the problem of shared data changes.

Microservices are a pattern of process scaling. We can try to define microservices as follows:

An architectural style.

● A single service should focus on one thing as much as possible, with high cohesion and low coupling;
● Process isolation;
● Each service can be independently developed, tested, built, and deployed;
● Small and flexible;

The advantages brought by the microservices architecture include:

● Lead time. Each service can be independently developed, tested and delivered, reducing cycle time;
● Fast communication. Small team development reduces communication costs caused by code coupling; business is split by service, newcomers do not need to understand the overall structure, and they can get started quickly;
● Customization. New business scenarios can be flexibly combined according to market demands;
● Isolation. Process isolation mode, effective control of fault range;
● Technology stack. Different technology stacks can be selected according to service requirements;
● Evolution optimization. Evolution and optimization can be performed according to service granularity;

Problems that arise at the same time include :

● Architecture complexity. Various complex architectural issues due to the explosion in the number of services. For example, consistency issues, the complexity of a large number of remote interface calls;
● Management costs. The explosion in the number of services leads to increased management and operation and maintenance costs. We hope that the delivery cycle will be short, and the short cycle will inevitably lead to faster change. Change is the natural enemy of availability. Problems need to be solved through automated and visual means;
● Fault location. For example, in a request involving dozens of services, how to quickly locate the problem when a fault occurs;
● Performance loss. Originally, one call could return the result, but now it needs to flow through several or dozens of services to return the result. How to improve the response time? How to compensate for the reduced throughput caused by splitting? Cloudification means horizontal scaling. How does the architecture scale? Improve overall system throughput.

2. The practical process

The first step is to set goals.

Usually you get requirements like this: "Be faster!" "No lower performance than before". Obviously, this is not a valid indicator.

How to set goals? Start by listing a few common but ineffective goals.

● The average response time of 1s
can be seen in this set of data, [2, 5, 3, 4, 301, 4, 2, 8, 7, 3, 3, 1, 1, 8, 2] AVG(f)=23.6 Average response The time deviates due to a very severe timeout. The response time cannot be described correctly.

● 99% of requests should be completed within 1S
First, the response time of different services should not be the same;
secondly, it is meaningless to simply set the response time, it must be set under the throughput;

● Support 1 million concurrent users?
First of all, this is not equal to the setting of the system throughput;
secondly, the growth rate is the premise of how much data the system currently has;


No matter how high the error rate indicator is, it is useless to have errors.

We can try to set it like this:
for example, when placing an order, the
response time is 95%<1s throughput>100,000 tps
system data volume: 1 billion> order table> 100 million...
Growth rate: 100 million/year
Resource limit: server Resources...
other influencing factors...

At the same time, other goals should also be set synchronously, such as overall system availability, consistency, etc.

The second step is to find bottlenecks.

Do a stress test first. It can be tested from the production environment, and the results measured in the test environment are meaningless. In addition, it is meaningless to test a function based on a scenario.

Second, conduct comprehensive monitoring, including call chain analysis, to quickly locate performance bottlenecks.

The third step is optimization.

There are many optimization methods, including: synchronous to asynchronous, blocking to non-blocking, data redundancy, data splitting, data merging, compression, simplifying business links, etc. The key depends on the application scenario and cost.

service framework

 

 

As shown in Figure 1, the details page in an e-commerce business will call services such as price, inventory, and commodities, and comprehensively display the information. If it is a serial call, the total time is equal to the sum of the time spent on each service; The total time is equal to the maximum time of the three services, and the performance is improved significantly.

There are many other optimization points, such as:
● Use efficient serialization protocols. Protocols such as protobuf and thrift are much better than http+json and can be used inside the service;
● Use long connections to avoid repeated connection establishments. Performance loss;
● Isolation of business threads and IO threads, etc.

message middleware

 

 

Through the message middleware, you can cut peaks and fill valleys and improve throughput. As shown in Figure 2, the ordering operation is sent directly to MQ and then returned. MQ ensures final consistency and reduces response time. Change dependencies from strong dependencies to weak dependencies. That is to say, the temporary unavailability of the order system has no effect on the order placing operation. In addition, the throughput of MQ is much larger than that of relational databases, and MQ expansion is relatively more convenient.

Of course, using MQ also has certain problems. There is a consistent time window, which is fatal for businesses that require strong consistency.

Distributed cache

Caching is a powerful tool used to improve performance. The local cache cannot be shared, which will lead to a large waste of memory. On the other hand, garbage collection will also affect business services. In the microservice architecture, we generally require external state to the cache and database, and large-scale applications mostly use distributed cache.

Since the database is more complicated to expand and brings many sequelae, it is a very good practice to use the cache to balance the pressure on the database.

Distributed caches, such as redis and memcache, have a throughput of about 100,000 qps, which is a very big improvement compared to several thousand qps of databases.

Of course, the problem brought by distributed cache is consistency. When to update the cache? What if the cache update fails and the database update succeeds?

database

Database optimization is very straightforward and effective.

Sorted by priority, the optimization methods are as follows:
● Index, redundancy, batch write
● Reduce lock granularity
● Reduce complex query
● Appropriate transfer of transaction processing
● Improve hardware performance
● Read-write separation
● Database partitioning
● Vertical table partitioning
● Horizontal sub-table
● Choose NoSql according to business situation

3. Case Analysis

 

 

As shown in Figure 3, in e-commerce, a price service, in order to improve the efficiency of writing, can use message middleware, in order to solve the problem of repeated submission, (especially when a system is unavailable, users will submit frequently, resulting in Artificial storms) can be deduplicated through caching.

If one user submits one million price change information, and another user submits a price modification request, the request will be blocked for a long time by the one million request. At this time, the message middleware needs to have the concept of priority. If You can't do priority, you can solve the problem by establishing multiple queue classifications.

If a user submits one million price revisions and finds that one of them is wrong, and changes one of them and then submits it, the above method will cause the new version to be overwritten by the old version. We need to solve this problem by establishing a version number.

 

4. Summary

● Not all places require high performance, and it is necessary to balance code readability, maintainability, and architectural complexity;
● Before optimization, find the driving force;
● Correctly deal with other problems caused by optimization.

 

From November 9th to 12th, Beijing National Convention Center, the 6th TOP100 Global Software Case Study Summit, Huawei Lean and Agile Expert Chen Jun will share "Huawei's 100-person Team Lean Kanban Evolution and Reform Road"; Huawei Cloud Computing Test Manager Li Chaofeng will share "Huawei Cloud Virtualization Quality Platform Construction Practice".

 

The TOP100 Global Software Case Study Summit has been held for six sessions to select outstanding global software R&D cases, with 2,000 attendees every year. Including product, team, architecture, operation and maintenance, big data, artificial intelligence and other technical special sessions, on-site learning of the latest research and development practices of first-line Internet companies such as Google, Microsoft, Tencent, Ali, Baidu and so on. Application entrance for the single-day experience ticket for the opening ceremony of the conference

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326193371&siteId=291194637