Common performance tuning strategies and their application in risk control scenarios

introduction

Performance tuning is also traceable. This article sorts out the general performance optimization strategies accumulated in the actual development process , and combines the usage scenarios in the wind control system service to help readers understand the feasible strategies related to performance tuning, so as to establish the concept of performance optimization SOP . In the future, problems can be modified by referring to the optimization process.

Performance Optimization Strategies

space-time transformation

Anyone who has read the algorithm topic knows that the scoring conditions include: time complexity and space complexity . If the consumption of both is small, the higher the score, the excellent algorithm.

But in the actual development process, generally the two cannot have both. Either take up less space and run longer, or pursue efficiency and take up more space. What we have to do is to optimize one of them in a specific demand scenario, and often need to sacrifice the other to achieve the goal.

space for time

In the current business scenario, what is pursued is the ultimate performance , that is, the response speed must be fast enough, such as the opening speed of the screen opening page. If every time you open H5, you will go to the server to request the rendering data of the page. In addition, the user’s location is different. Accessing Beijing from New York users is definitely much slower than accessing local users in Beijing. Then there is a tool like **CDN**. The operator's CDN nodes are all over the world. After the page is developed, it is directly pushed to the CDN around the world for caching, so that users only need to hit the nearest CDN node when accessing, and the speed will naturally increase, but the cost of space is paid.

There are similar scenarios in the risk control scenario. The decision engine also needs space in exchange for time for ultimate performance:

Configure data cache : The data association of decision flow is complicated. If you interact with DB or data center every time, I/O will take a long time, and the gain outweighs the gain. A local cache + update trigger is a better choice.
Short-term cache : For hot information that will not change for a period of time (or can tolerate a period of time), accessing the cache will be more conducive to query speed.

time for space

This strategy does the opposite, trading time for space. At this time, " space is more precious ", for example, memory is more precious than disk, but it takes a certain amount of time to obtain it when it is stored in disk and then scheduled for use.

In the risk control scenario, there are also many scenarios:

Save space cost : "on-cloud" servers are more expensive space resources than "off-cloud" self-built IDC computer rooms, so wind control puts a large amount of non-real-time data on the server under the cloud for calculation. When the cloud service needs to be used, it needs to pay a certain call time: cross-computer room calls (dedicated line bandwidth competition) need to consume an additional 5~10 ms
Compression and upload of device fingerprint collection data : wind control relies on device fingerprints, which are SDKs embedded in the app to collect machine information. If a large amount of device information is not compressed and uploaded, it will occupy a lot of bandwidth and affect normal user requests. After compression, space is saved, but the price paid is that each time you need to pay extra time for compression/decompression.

preprocessing/postprocessing

advance processing

Preprocessing is mainly for speed . For example, the prefetching operation of the CPU and memory stores the instructions and data in the memory in the cache in advance, thereby speeding up the execution.

In order to ensure that the policy execution RT is controlled within 200ms, the decision engine needs to optimize and compress the execution time of the policy. Assuming that no matter how optimized a policy is, the execution time will exceed 200ms. At this time, the preprocessing operation can be triggered in advance on the previous event (scenario).

For example: when the user issues an order, it is necessary to determine the risk of the group, but accessing the group is time-consuming. Then, before the user enters the order to issue an order, it can be triggered to query the group information and wait for a while. When the actual order is actually issued, it is sufficient to directly read the last result. Similarly, we can trigger some operations when the user logs in, which is conducive to the perception of subsequent risk control events.

Postponement

Resolutely refrain from implementation until necessary to save costs . The most famous example of using this strategy is COW (Copy On Write, Copy On Write). Assuming that multiple threads want to operate a copy of data, in general, each thread can copy a copy and put it in its own space. But the copy operation is time-consuming. If the system adopts lazy processing, the copy operation will be postponed. If multiple threads only have read requests for this data, then the same data resource can be shared, because the "read" operation will not change this data. When a thread needs to modify this data (write operation), the system will copy a copy of the resource for the thread to use, allowing rewriting, so that other threads will not be affected.

Delay operation is mainly to save costs in risk control. For the ultimate performance of the decision engine, many variables (or called features/indicators) are loaded in parallel at one time, but at this time some variables are third-party charging indicators, such as IP, Tongdun, Antdun, etc. The benefits of preloading are obvious, but it also greatly increases costs: users may be rejected or directly passed through the whitelist before reaching the payment variable decision node. At this time, it is a great waste for these users to request three parties in advance. Only when the payment strategy is actually reached, the request will be made, and the cost is the smallest at this time.

parallel/asynchronous operation

parallel

If you can't do the work alone, find a few more people to do it together! Parallel operation, high processing efficiency (the premise is that the machine has multiple cores), the time is greatly shortened, and the RT time is greatly shortened. The vast majority of Internet servers either use multi-processes or multi-threads to process user requests to make full use of multi-core CPUs. Another situation is that where there is IO blocking, it is also very suitable to use multi-threaded parallel operation, because in this case the CPU is basically idle, and multi-threading can allow the CPU to do more work.

If the execution strategies of the decision-making engine are executed synchronously, it may not be able to complete the execution in a few minutes. Using parallelism to give full play to the performance of multiple CPU cores, the performance bottleneck at this time is the longest board, as long as you specialize in optimizing it.

asynchronous

Asynchronous relative to synchronous, is whether to wait for the result or return immediately. When a synchronous operation encounters a large number of internal I/O operations, the performance loss is huge. At this time, the asynchronous operation will greatly improve the throughput of the system. But there are advantages and disadvantages. Asynchronous operations also increase the complexity of the program, and additional operations such as failure compensation need to be considered.

Such scenarios can be seen everywhere in the risk control system:

MQ message : Natural asynchronous processing, relying on the peak-shaving and valley-filling characteristics of the message consumption mechanism, it is very suitable for processing with messages, such as offline decision-making, in the case of time-consuming operation actions and the business does not need to return synchronously.
Buried points, monitoring and sampling : For operations not initiated within the process of business concern, in order not to affect RT, additional operations need to be processed asynchronously.

cache/batch merge

data cache

The purpose of caching is to speed up. This is a consensus that we have basically reached since we started learning programming languages. Basically, as long as performance is considered in each system, caching will be used to some extent. Some of our commonly used tools are also interspersed with the shadow of the cache, such as:

IOC inversion of control : not only dependency injection, but also saves the time of creating beans
Active threads in the thread pool : The purpose of the pooling concept is to increase speed. The pool itself is a cache container, and the frequent creation of objects in the pool is not worth the loss. At this time, a fixed batch is in the pool and returned when it is used up, which greatly avoids the cost of creating new objects.

Batch merge processing

Batch operations are generally used when encountering I/O operations. They can query as much data as possible at one time to reduce network time consumption (note that this is not absolute, think about why paging is needed, and find a balance between time and space).

Common batch operations are as follows:

Database query batch processing : one-time in query when processing multiple pieces of data, avoiding frequent single loop query
Redis scan data : you can use scan commands instead of frequent get (note: it will cause redis hold and need to balance the number of keys)

Summarize

There are many ways to optimize performance, sometimes from different angles, it can play a miraculous effect, of course, the premise is that it must be within the acceptable cost of the business, and it is unrealistic not to consider the actual situation.

The performance tuning strategies introduced above are summed up and pondered during the daily development process, hoping to help readers establish a performance optimization SOP , and have a common means of troubleshooting, so that they can know when to use it.

Wonderful past

Welcome to pay attention to the official account: Cuckoo Chicken Technology Column
Personal technology blog: https://jifuwei.github.io/

reference: