Summary of Common Performance Optimization Strategies

Summary of common performance optimization strategies
Xiaoming 2016-12-02 21:52
https://tech.meituan.com/performance_tunning.html

In this article, I would like to thank one of the judges in my rank evaluation process. He suggested to refine and summarize the various performance optimization cases and solutions that have been done before, precipitate them in the form of documents, and share them internally. Strive to achieve the following effects:

  1. Form a variety of performance optimization solutions and selection considerations that can be practiced, referenced, and referenced, and at the same time cooperate with specific real cases. When others encounter similar problems, they do not need to start from scratch.

  2. It helps to broaden horizons. In addition to performance optimization, it can also provide general common ideas and considerations for program selection, helping everyone develop awareness, thinking, and the ability to make various trade-offs during program selection.

After the article was shared internally, it aroused strong sharing, and was recognized and praised by many colleagues and friends, and felt that it had a good guiding role in daily work. Considering that these experiences may also be helpful to industry peers, they are published on the Meituan Dianping technical team blog.

Common performance optimization strategy classification
Code
The reason why the code is put in the first place is that this is the most easily overlooked by technical personnel. After many technicians get a performance optimization requirement, they must call it cache, asynchrony, JVM, etc. In fact, the first step should be to analyze the relevant code, find the corresponding bottleneck, and then consider the specific optimization strategy. There are some performance problems, which are entirely due to unreasonable code writing. The problem can be solved by directly modifying the code, such as too many for loops, making a lot of unnecessary conditional judgments, and repeating the same logic many times.

Database
Database tuning is generally divided into the following three parts:

SQL tuning
This is the most commonly used, and every technical person should master the basic SQL tuning methods (including methods, tools, auxiliary systems, etc.). Taking MySQL as an example here, the most common way is to locate the specific problematic SQL by using the built-in slow query log or the open source slow query system, and then use tools such as explain and profile to gradually tune, and finally achieve the result after testing. Online after the effect. For details in this regard, please refer to MySQL Index Principle and Slow Query Optimization.

Architecture-level tuning
This type of tuning includes read-write separation, multi-slave load balancing, horizontal and vertical sub-database sub-tables, etc. Generally, large changes are required, but the frequency is not as high as SQL tuning, and DBA is generally required. to cooperate with participation. So when do these things need to be done? We can regularly track whether some indicator data reaches the bottleneck through the internal monitoring and alarm system (such as Zabbix). Once the bottleneck or warning value is reached, these things need to be considered. Typically, DBAs also regularly monitor these metric values.

Connection pool tuning In order to achieve efficient acquisition of database connections and current limiting of database connections,
our application usually adopts a connection pool type scheme, that is, each application node manages a connection pool to each database. With the growth of business access or data volume, the original connection pool parameters may not be able to meet the needs well. At this time, it is necessary to combine the current principle of using the connection pool, the specific connection pool monitoring data and the current business volume to make a Based on the comprehensive judgment, the final tuning parameters are obtained through repeated debugging several times.

Cache
Classification
Local cache (HashMap/ConcurrentHashMap, Ehcache, Guava Cache, etc.), cache service (Redis/Tair/Memcache, etc.).

Usage Scenarios
When is the cache suitable for use? Consider the following two scenarios:

If the same data is queried multiple times in a short period of time and the data is not updated frequently, you can choose to query from the cache first, and then load it from the database and reset it to the cache if the query fails. This scenario is more suitable for single-machine caching.
High concurrent query hot data, the back-end database is overwhelmed, you can use the cache to carry it.
Selection Considerations
If the amount of data is small and does not grow and empty frequently (which would result in frequent garbage collections), then a local cache can be chosen. Specifically, if you need the support of some strategies (such as the eviction strategy when the cache is full), you can consider Ehcache; if not, you can consider HashMap; if you need to consider multi-threaded concurrency scenarios, you can consider ConcurentHashMap.
In other cases, you can consider caching services. At present, we give priority to Tair in terms of resource investment, operability, whether it can be dynamically expanded, and supporting facilities. Unless currently Tair can not support the occasion (such as distributed lock, Hash type value), we consider using Redis.
Design Key
Points When to update the cache? How to ensure the reliability and real-time performance of updates?
The strategy for updating the cache requires specific analysis of specific problems. Here, take the cached data of the store POI as an example to illustrate what the cache update strategy of the cache service type is. At present, about 100,000 POI data uses Tair as a cache service. There are two specific update strategies:

Receive news of store changes and update them in quasi-real time.
Set an expiration time of 5 minutes for each POI cache data, load it from the DB and reset it to the DB after the expiration. This strategy is a powerful supplement to the first strategy, and solves the problem of the first strategy failure caused by the manual change of the DB without sending messages, and the temporary error of the update program when receiving messages. Through this double insurance mechanism, the reliability and real-time performance of POI cache data are effectively guaranteed.
Will the cache be full? What should I do if the cache is full?
For a cache service, theoretically, with the increasing number of cached data, the cache will definitely be full one day when the capacity is limited. how to respond?
① For the cache service, select an appropriate cache eviction algorithm, such as the most common LRU.
② For the currently set capacity, set an appropriate warning value, such as a 10G cache. When the cached data reaches 8G, an alarm will be issued to troubleshoot the problem or expand the capacity in advance.
③ Give some keys that do not need to be stored for a long time, and try to set the expiration time.

Does the cache allow for misses? What if I lose it?
Determine whether to allow loss according to the business scenario. If it is not allowed, a cache service with persistence function is needed to support it, such as Redis or Tair. For more details, you can choose a more specific persistence strategy, such as Redis's RDB or AOF, according to the business's tolerance for lost time.

The problem of cache "breakdown"
For some keys with expiration time set, if these keys may be accessed with super high concurrency at certain time points, it is a very "hot" data. At this time, another problem needs to be considered: the problem of "breakdown" of the cache.

Concept: When the cache expires at a certain point in time, there are a large number of concurrent requests for this key at this point in time. When these requests find that the cache expires, data is generally loaded from the back-end DB and set back to the cache. At this time, there is a large concurrency. requests can instantly overwhelm the backend DB.
How to solve it: The common practice in the industry is to use mutex. To put it simply, when the cache is invalid (it is judged that the value taken out is empty), instead of going to load db immediately, first use some operations of the cache tool with a return value of a successful operation (such as Redis's SETNX or Memcache) ADD) to set a mutex key, when the operation returns successfully, perform the load db operation and reset the cache; otherwise, retry the entire get cache method. Similar to the following code:

public String get(key) {
String value = redis.get(key);
if (value == null) { //Represents that the cache value expires
//Set a 3min timeout to prevent the next cache expiration when the del operation fails Cannot load db
if (redis.setnx(key_mutex, 1, 3 * 60) == 1) { //Represents successful setting
value = db.get(key);
redis.set(key, value, expire_secs);
redis.del (key_mutex);
} else { //At this time, it means that other threads at the same time have loaded db and set it back to the cache. At this time, retry to get the cached value to
sleep(50);
get(key); //Retry
}
} else {
return value;
}
}
Asynchronous
Usage Scenario For
some client requests, the server may need to do some auxiliary things for these requests, which the user does not care about or the user does not need to get these things immediately In this case, it is more suitable to handle these things in an asynchronous manner.

Function
Shorten the interface response time, make the user's request return quickly, and the user experience is better.
Avoid running the thread for a long time, which will cause the available threads of the service thread pool to be insufficient for a long time, which will cause the length of the thread pool task queue to increase, thereby blocking more request tasks, so that more requests cannot be technically processed.
Threads running for a long time may also cause a series of problems such as system load, CPU usage, and overall machine performance degradation, and even cause avalanches. The asynchronous idea can effectively solve this problem without increasing the number of machines and CPUs.
Common practice
One way is to open up additional threads. Here you can use an additional thread or use a thread pool to process the corresponding tasks in a thread other than the IO thread (processing request response), and let the response first in the IO thread. return.

If the amount of data designed for tasks processed by asynchronous threads is very large, a blocking queue BlockingQueue can be introduced for further optimization. The specific method is to let a batch of asynchronous threads continuously throw data into the blocking queue, and then start an additional processing thread to take a batch of data of a preset size from the queue in batches for batch processing (such as sending a batch of remote services) request), which further improves performance.

Another approach is to use a message queue (MQ) middleware service, which is inherently asynchronous. Some additional tasks may not require my system to handle, but other systems to handle. At this time, you can first encapsulate it into a message, throw it into the message queue, deliver the message to the system that cares about it through the reliability of the message middleware, and then let the system do the corresponding processing.

For example, after the C-side completes a bill of lading action, it may require other terminals to do a series of things, but the results of these things will not immediately affect the C-side users, so the request response of the C-side order can be returned to the C-side first. Users, go to MQ to send a message before returning. And these things should not be the responsibility of the C side, so at this time, it is most appropriate to use the MQ method to solve this problem.

The difference between NoSQL
and caching Let me explain
first . The section introduced here is different from the section about caching. Although the same data storage scheme (such as Redis or Tair) may be used, the way of using it is different. It is used as DB. If it is used as a DB, it is necessary to effectively ensure the availability and reliability of the data storage solution.

Usage scenarios It
is necessary to combine specific business scenarios to see whether the data involved in this business is suitable for NoSQL storage, whether the data operation method is suitable for NoSQL operation, or whether some additional features of NoSQL (such as atomic addition, subtraction, etc.).

If business data does not need to be associated with other data, does not require support such as transactions or foreign keys, and may be written very frequently, NoSQL (such as HBase) is more suitable at this time.

For example, Meituan Dianping has a monitoring system for exceptions. If a serious failure occurs in the application system, a large amount of exception data may be generated in a short period of time. If MySQL is used at this time, it will cause the instantaneous write pressure of MySQL to soar, and it is easy to This leads to problems such as the sharp deterioration of the performance of the MySQL server and the delay of master-slave synchronization. This scenario is more suitable for NoSQL storage like Hbase.

When is JVM
tuning tuned?
Through the monitoring system (if there is no ready-made system, it is easy to make a simple reporting and monitoring system by yourself) on some key machine indicators (gc time, gc count, memory size changes of each generation, load value and CPU of the machine) Monitoring alarms such as usage rate, number of JVM threads, etc.), you can also see the output of commands such as gc log and jstat, combined with the performance data and request experience of some key interfaces of online JVM process services, you can basically locate the current situation. Is there a problem with the JVM and whether it needs tuning.

How to adjust?
If you find that the CPU usage and Load values ​​are too large during the peak period, you can observe some JVM thread counts and gc counts (may be mainly young gc counts) at this time. Experience value for comparison), it can basically be located that the frequency of young gc is too high. At this time, it can be solved by appropriately increasing the size or proportion of the young area.
If you find that the response time of key interfaces is very slow, you can combine the gc time and the stop the world time in the gc log to see if the stop the world time of the entire application is relatively long. If yes, it may be necessary to reduce the total gc time. Specifically, it can be considered from the two dimensions of reducing the number of gcs and reducing the time of a single gc. Generally speaking, these two factors are a pair of mutually exclusive factors. We It is necessary to adjust the corresponding parameters (such as the ratio of the new generation to the old generation, the ratio of the eden to the survivor, the MTT value, the threshold of the old area ratio that triggers the recovery of cms, etc.) according to the actual monitoring data to achieve an optimal value.
If full gc or old cms gc occurs very frequently, this situation usually induces a corresponding increase in the STW time, which will also lead to slower interface response time. In this case, there is a high probability of a "memory leak". Memory leak in Java means that some objects that should be released have not been released (and the reference is pulling it). So how are these objects created? Why won't it be released? Is there something wrong with the corresponding code? The key to the problem is to understand this, find the corresponding code, and then prescribe the right medicine. So the crux of the problem is to translate into finding these objects. How to find it? Using jmap and MAT comprehensively, you can basically locate the specific code.
Multi-threading and distributed
usage scenarios
Offline tasks, asynchronous tasks, big data tasks, and running tasks that take a long time to run**, properly utilized, can achieve the effect of acceleration.

Note: When the online response time is high, try to use less multi-threading, especially when the service thread needs to wait for the task thread (many major accidents are closely related to this). If you must use it, you can set a service thread. Maximum waiting time.

Common practice
If the processing capability of a single machine can meet the needs of the actual business, then use the single-machine multi-threaded processing method as much as possible to reduce the complexity; otherwise, it is necessary to use the multi-machine multi-threaded method.

For single-machine multi-threading, the mechanism of thread pool can be introduced, which has two functions:

Improve performance and save the overhead of thread creation and destruction.
Limit current, and give the thread pool a fixed capacity. After reaching this capacity value, if a task comes in, it will enter the queue for queuing, so as to ensure the stable processing capacity of the machine under extreme pressure. When bringing a thread pool, you must carefully understand the meaning of each parameter of the construction method, such as core pool size, max pool size, keepAliveTime, worker queue, etc. On the basis of understanding, adjust these parameter values ​​through continuous testing to achieve optimal results .
If the processing power of a single machine cannot meet the requirements, a multi-machine and multi-threading method needs to be used at this time. This time requires some knowledge of distributed systems. First, a separate node must be introduced as a scheduler, and other machine nodes are used as executor nodes. The scheduler is responsible for splitting tasks and distributing tasks to appropriate executor nodes; executor nodes execute tasks in a multi-threaded manner (possibly single-threaded). At this time, our entire task system has evolved from a single click to a clustered system, and different machine nodes have different roles, perform their own duties, and interact with each other. At this time, in addition to mechanisms such as multiple threads and thread pools, mechanisms such as RPC, heartbeat and other network communication calls are also indispensable. In the follow-up, I will come up with a simple distributed scheduling framework.

Measurement system (monitoring, alarming, service dependency management)
Strictly speaking, the measurement system does not belong to the category of performance optimization, but this aspect is closely related to performance optimization, which can be said to provide a strong data reference and support for performance optimization. Without a measurement system, there is basically no way to locate the problems of the system, and there is no way to effectively measure the effect of optimization. Many people don't pay attention to this aspect, but I think it is the cornerstone of system stability and performance guarantee.

Key Processes
If the system is to be designed, what are the key processes that need to be designed in general?
① Determine indicators
② Collect data
③ Calculate data and store results
④ Display and analyze

What metrics need to be monitored and alerted on? Which need to pay attention to?
According to the needs, two main indicators are needed:

Interface performance is related, including a single interface and all QPS, response time, and call volume (the more detailed the statistical time dimension, the better; it is best to view relevant data in terms of nodes or service clusters) . It also involves the management of service dependencies. At this time, you need to use the service dependency management system related to
a single machine node, including CPU usage, Load value, memory usage, network card traffic, etc. If the node is some special type of service (such as MySQL, Redis, Tair), you can also monitor some key indicators specific to these services.
The data collection method
usually adopts the asynchronous reporting method. There are two specific methods: the first method is to send it to the local Flume port, and the Flume process collects it to the remote Hadoop cluster or Storm cluster for calculation; the second method, directly on the local After the calculation is completed, it is sent to the monitoring server using asynchronous and local queue methods.

Data calculation
can be done in offline computing (MapReduce/Hive) or real-time/quasi-real-time computing (Storm/Spark), and the result of the operation is stored in MySQL or HBase; in some cases, it can be directly collected and sent to the monitoring server without calculation .

Display and Analysis
Provide a analysis platform, which requires monitoring and alarming functions with reports (lists/charts).

Real case study
Case 1:
Background of the refresh job between the business and the control area
This is a job that runs regularly every hour to refresh the relationship between the business and the control area. The specific rule is based on whether there is an intersection between the distribution range (multiple) of the merchant and the control area. If there is an intersection, the merchant is placed within the scope of this control area.

Business needs
require this process to be as short as possible, preferably within 20 minutes.

Optimization process The main processing flow of the
original code is:

Get a list of delivery areas and control areas for all stores.
Traverse the list of control areas, for each control area:
a. Traverse the list of delivery areas of the merchant, and find the list of delivery areas that intersect with this control area.
b. Traverse the above merchant delivery range list, deduplicate the merchant IDs in it, and save them into a collection.
c. Obtain the corresponding merchant collection in batches according to the above merchant ID collection.
d. Traverse the above merchant collection, get each merchant object from it, and process it accordingly (judging whether it is necessary to insert or update the relationship between the previous merchant and the control area according to whether it is already a popular merchant, self-operated, online payment, etc.) .
e. Delete the list of merchant relationships that currently exist in this control area, but should not exist.
Analyze the code, find steps a and b in step 2, find out the set of delivery ranges that intersect with a certain control area, and de-duplicate the merchant ID, which can be optimized by using the R-tree spatial index. The specific method is:

The task starts by updating the R-tree, and then uses the structure of the R-tree and matching algorithm to get the list of delivery range IDs that intersect with the control area.
Then, according to the list of delivery scope IDs, get the delivery scope list in batches.
Then, for this batch of delivery range lists (the number is very small), the method of intersecting and matching the original polygons is used for further filtering, and the filtered merchant IDs are deduplicated.
This optimization has been launched in the first phase of optimization, and the entire process time has been shortened from more than 40 minutes to less than 20 minutes.

After the first phase of optimization was changed to R-tree, it ran for a period of time. As the amount of data increased, the performance began to deteriorate again, and it deteriorated to more than 50 minutes after a month. So I continued to analyze the code in depth, found two optimization points, arranged the second phase of optimization and launched it.

The two optimization points are:

The c step of the second step, originally obtained the stores in batches from the DB according to the store ID list, can now be changed to mget and obtained from the cache in batches (the merchant data has been cached at this time);
the d step of the second step, according to whether the It is the conditions of popular merchants, self-operated, online payment, etc. to judge whether it is necessary to insert or update the relationship between the previous merchant and the control area.
The effect after going online
Through the log observation, the execution time has been shortened from more than 50 minutes to less than 15 minutes. The following figure shows the log time (unit: milliseconds) of 4 machines in one day:
poi optimization effect diagram
It can be seen that the effect is still very obvious. of.

Case 2: POI Cache Design and Implementation
Background
In Q4 2014, the read traffic of data related to POI (which can be simply understood as a takeaway store) in the database increased sharply. Although adding a slave node can solve some problems, after all, the node The increase of data will reach the limit. After the limit is reached, the master-slave replication will reach the bottleneck, which may cause data inconsistency. Therefore, at this time, it is urgent to introduce a new technical solution to share the pressure of the database and reduce the read traffic of the database POI-related data. In addition, considering adding a DB slave library in any scenario will cause a certain waste of resources.

Implementation Scheme
Based on the existing tried-and-tested technical scheme, I choose Tair as the storage scheme for the cache to help the DB share the pressure of the read traffic of POI data from each application. The reasons are mainly from the aspects of availability, high performance, scalability, whether it has passed the test of online large-scale data and high concurrent traffic, whether there is a professional operation and maintenance team, and whether there are mature tools.

Detailed design
The first version of the design
The cache update strategy, according to the business characteristics, existing technical solutions and implementation costs, chooses to use MQ to receive the POI change message to trigger the cache update, but this process may fail; at the same time enable The expiration policy of the key is set, and the caller will first determine whether it has expired. If it expires, it will load the data from the back-end DB and set it back to the cache, and then return. The availability of cached data is ensured by double insurance in two aspects.

Second Edition Design After running the
first edition design for a period of time, we found two problems:

In some cases, the real-time consistency of data cannot be guaranteed (for example, technicians manually change DB data, and use MQ to update the cache fails). At this time, you can only wait for the expiration time of 5 minutes, and some services are not allowed.
Adding the expiration time leads to another problem: Tair will try to load data from the hard disk at the moment when the cache misses, and if the hard disk does not go to the DB to load the data. This will undoubtedly further prolong the Tair's response time, which will not only increase the overtime ratio of the business, but also lead to further deterioration of the Tair's performance.
In order to solve the above problems, we learned from the colleagues in charge of infrastructure in Meituan Dianping that Databus can solve the problem of inconsistency of cached data in some cases, and can remove the expiration time mechanism, thereby improving query efficiency and avoiding memory misses in tair when querying the hard disk. And in order to prevent the single point failure of DataBus from affecting our business, we retained the previous scheme of updating the cache with MQ messages, made a switch, and used this scheme for fault tolerance. The overall architecture is as follows:
poi cache design diagram

After
the , through continuous monitoring of data, it is found that with the increase in the number of calls, the traffic to the DB has been significantly reduced, greatly reducing the pressure on the DB. At the same time, the response time of these data interfaces has also been significantly reduced. The double guarantee mechanism of cache update also basically guarantees the availability of cached data. See the figure below:
poi cache optimization renderings_1
poi cache optimization renderings

Case 3: Performance optimization of related pages in the background of business operations
Background
With the rapid development of the business, the traffic and data volume have increased sharply. Through our corresponding monitoring system, we can find that the performance of some pages in the system has begun to deteriorate. . Feedback from the user side also proves this. At this moment, it is necessary to schedule quickly, develop agilely, and tune these pages.

Welcome page
Requirement background: The welcome page is the home page for the ground push staff and even the headquarters staff with various roles to enter the takeaway operation background. It will display some core data that the ground push staff want to see and care about the most. Its importance is self-evident, so the The performance degradation of the page can seriously affect the user experience. Therefore, the first thing that needs to be optimized is the welcome page. Through corresponding positioning and analysis, it is found that there are two main reasons for performance deterioration: the data interface layer and the computing presentation layer.
Solution: Prescribe the right medicine, divide and conquer. After careful investigation, analysis and positioning, the data interface layer uses batch interface calls and asynchronous RPC calls for effective optimization. The calculation presentation layer decides to use pre-calculation and then cache the calculated results to improve the query speed. Among them, the caching solution uses Redis according to business scenarios and technical characteristics. After the plan is determined, the rapid development is launched.
Online effect: The performance comparison chart after going online is as follows:
Optimization effect chart_1
Organization structure page
Requirement background: The organization structure page uses a four-layer tree structure diagram, which is displayed and loaded together. After the first version was launched, it was found that the performance was very poor. Users are eager to tune the performance of this page.
Solution: After analyzing the code, a relatively classic problem was located: too many SQL queries with small amounts of data were executed. Therefore, multiple SQLs are merged into a large SQL, and then the local cache is used to cache the data, reasonably estimate the data volume and performance, and go online after full testing.
Online effect: The performance comparison chart after going online is as follows:
Optimization effect chart_2
Order-related building page
Demand background: With the increasing number of orders, the data accumulated in the order table is increasing, and the performance of the order-related building page is also getting worse ( response time increases linearly). And this page is closely related to the performance of the ground push staff, so the ground push staff use this page very frequently, and the performance deterioration has greatly affected the ground push staff's user experience.
Solution: After analysis and design, it was decided to use the existing order secondary index monthly sub-table to replace the original order table for front-end query requests; and to limit the time conditions for screening, so that the start time and end time of the screening It cannot span months (it has been communicated with users in advance, it is acceptable, and it can meet the basic needs of users), so it only needs to index the table in one month, and achieve performance tuning through appropriate functional restrictions. In this way, the final paged order ID set can be found from the monthly sub-table of the secondary index according to various query conditions, and then the corresponding order data set can be found from the order database according to the order ID. Online effect
: After going online, it is found that the performance has improved significantly when the call volume has hardly changed . Indexes, spatial indexes, etc. Due to space limitations, we leave it to the future to introduce it.


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324401590&siteId=291194637