Interface timeout analysis

Original: The interface suddenly timed out! ! !

1. Network abnormality

1.1, network jitter

We who often surf the Internet must have encountered such a scene: most of the time we visit a certain website very quickly, but occasionally the webpage keeps turning in circles and cannot be loaded.

It may be that your network is jittering and packets are lost.

When the webpage requests the API interface, or the interface returns data to the webpage, network packet loss may occur.

网络丢包It may cause the interface to time out.

1.2. Bandwidth is full

Sometimes, due to the unreasonable design of the page or interface, when the number of user requests increases suddenly, the network bandwidth of the server may be fully occupied.

服务器带宽It refers to a certain period of time 传输数据, 大小for example: 10M data is transmitted in 1 second.

If the amount of user requests suddenly increases and exceeds the upper limit of 10M per second, for example: 100M per second, and the server bandwidth itself can only transmit 10M per second, this will lead to delayed transmission of 90M data within this 1 second , resulting in an interface timeout.

2. The thread pool is full

The API interface we call may sometimes use 线程池asynchronous query data for performance considerations, and finally summarize the query results and return them.
insert image description here
The total time spent calling the remote interface is 200ms = 200ms (that is, the remote interface call that takes the longest)

Before java8, you can get the result returned by the thread by implementing the Callable interface.

After java8, this function is realized through the CompleteFuture class. Here we take CompleteFuture as an example:

public UserInfo getUserInfo(Long id) throws InterruptedException, ExecutionException {
    
    
    final UserInfo userInfo = new UserInfo();
    CompletableFuture userFuture = CompletableFuture.supplyAsync(() -> {
    
    
        getRemoteUserAndFill(id, userInfo);
        return Boolean.TRUE;
    }, executor);

    CompletableFuture bonusFuture = CompletableFuture.supplyAsync(() -> {
    
    
        getRemoteBonusAndFill(id, userInfo);
        return Boolean.TRUE;
    }, executor);

    CompletableFuture growthFuture = CompletableFuture.supplyAsync(() -> {
    
    
        getRemoteGrowthAndFill(id, userInfo);
        return Boolean.TRUE;
    }, executor);
    CompletableFuture.allOf(userFuture, bonusFuture, growthFuture).join();

    userFuture.get();
    bonusFuture.get();
    growthFuture.get();

    return userInfo;
}

Here I use executor, which means a custom thread pool, in order to prevent the problem of too many threads in high concurrency scenarios.

However, if there are too many user requests, the existing threads in the thread pool cannot handle them, and the thread pool will put the redundant requests in the queue and wait for the idle threads to process them.

If there are a lot of tasks queued in the queue, and an API request has been waiting for a while and cannot be processed in time, the interface timeout problem will occur.

At this time, we can consider whether the number of core threads is set too small, or there are multiple business scenarios sharing the same thread pool.

If it is because the core thread pool setting is too small, you can increase it.

If the same thread pool is shared by multiple business scenarios, it can be split into multiple thread pools.

3. Database deadlock

Sometimes the interface timeout is a bit inexplicable, especially when there is a deadlock in the database.

In the API interface you provide, a piece of data is updated through a certain id. At this time, it happens to be manually executing a SQL statement to update data in batches online.

The sql statement is in a transaction, and it happens to be updating that piece of data, so a deadlock may occur.

Since the execution time of the sql statement is very long, the update data operation of the API interface will be locked by the database for a long time, and the data cannot be returned even if it cannot, and the interface timeout problem will occur.

Are you talking about cheating or not?

Therefore, it is recommended that before performing batch operations on the database, you must evaluate the scope of data impact, and do not update too much data at one time, otherwise it may cause many unexpected problems.

In addition, the batch update operation is recommended to be performed during the period when there are few user visits, such as early morning.

4. Too many incoming parameters

Sometimes, an occasional interface timeout is caused by passing too many parameters.

For example: query the classification interface in batches based on the id set, if the amount of data in the id set is not large, if dozens or hundreds of ids are passed in, there will be no performance problems. After all, id is the primary key of the classification table, and the primary key index can be used, and the search speed of the database is very fast.

However, if the interface caller passes in thousands or even tens of thousands of ids at one time, and batch queries and classifications, the interface timeout problem may also occur.

Because the database will evaluate the time-consuming situation before executing the SQL statement, and there are too many query conditions, it is possible to scan the entire table faster.

Therefore, in this case, the SQL statement may lose the index, slow down the execution time, and cause the interface timeout problem.

Therefore, when designing the batch interface, it is recommended to limit the size of the incoming collection, for example: 500.

If it exceeds the maximum collection size we set, the interface will directly return failure and prompt the user: too many parameters are passed in at one time.

This restriction must be written in the interface document to prevent the interface caller from stepping on the pit if the interface fails to call in the production environment. It should be notified in place during the interface development phase.

In addition, what if there are many parameters to be passed in by the interface caller?

Answer: It may be that the requirements are unreasonable, or there is a problem with the system design. We should try our best to avoid this problem in the system design stage.

If we re-design the system and make major changes, there is a temporary solution: call the interface in batches with multiple threads in the interface caller, and finally summarize the results.

5. The timeout setting is too short

Normally, it is recommended that we set the connection timeout and read timeout parameters when calling the remote API interface, and these two parameters can be configured dynamically.

The advantage of this is that it can prevent the performance problem of calling the remote API interface, the response time is very long, and the situation of dragging our own service occurs.

For example: the remote API interface you call takes 100 seconds to return data, and the timeout period you set is 100 seconds. At this time, 1000 requests come to request the API interface, which will cause the tomcat thread pool to be full quickly, causing the entire service to be temporarily unavailable, at least the new request cannot be responded immediately.

So we need to set the timeout period, and the timeout period cannot be set too long.

For business scenarios with a small amount of concurrency, you can set these two timeouts a little longer, for example: the connection timeout is 10 seconds, and the read timeout is 20 seconds.

For business scenarios with a large amount of concurrency, it can be set to the second or millisecond level.

For the convenience of development, some small partners share these two timeouts in various business scenarios.

One day, in a business scenario with a large amount of concurrency, you shorten the timeout period.

However, it directly leads to the problem of calling the API interface timeout in business scenarios with a small amount of concurrency.

Therefore, it is not recommended that multiple business scenarios share the same timeout period. It is best to set different timeout periods separately according to the amount of concurrency.

6. Too much data is returned at one time

I don’t know if you have encountered such a requirement: We have a job that calls the third-party API query interface regularly every day to obtain the data updated yesterday, and then update it to our own database table.

Since the third party does not update much data every day, the response time of this API interface is relatively fast.

But suddenly one day, the API interface has an interface timeout problem.

Checking the logs shows that the API interface returns too much data at one time, and the update time of the data is the same.

It can be concluded that the API interface provider has performed a batch update operation and modified a large amount of data, resulting in the occurrence of this problem.

Even if we add a failure retry mechanism to the job, since the API returns too much data at one time, retrying is likely to cause the interface to time out, which will result in the failure to obtain the latest data from the third party the day before. .

Therefore, the third-party interface for querying incremental data based on the date is recommended to be paged. Otherwise, one day in the future, when encountering batch update operations, the interface may time out.

7. Infinite loop

Will the infinite loop also cause the interface to time out?

The infinite loop should not be discovered during the interface testing phase, why should it be discovered in the production environment?

Indeed, most of the infinite loop problems can be found during the testing phase.

But some infinite recursion is hidden deeper, such as the following situation.

There are actually two types of dead loops:

  • common endless loop
  • infinite recursion

7.1. Ordinary infinite loop

Sometimes the infinite loop is written by ourselves, such as the following code:

while(true) {
    
    
    if(condition) {
    
    
        break;
    }
    System.out.println("do samething");
}

The loop call of while(true) is used here, which is often used in CAS spin locks.

When the condition is equal to true, the loop is automatically exited.

If the condition is very complicated, once the judgment is incorrect, or some logical judgments are missing, an infinite loop may occur in some scenarios.

The occurrence of an infinite loop is most likely caused by a human bug by the developer, but this situation is easy to detect.

There is also a hidden and deep infinite loop, which is caused by the less rigorous writing of the code. If normal data is used, the problem may not be detected, but once abnormal data appears, an infinite loop will immediately appear.

7.2. Infinite recursion

If you want to print all parent categories of a certain category, you can use a recursive method like this:

public void printCategory(Category category) {
    
    
  if(category == null 
      || category.getParentId() == null) {
    
    
     return;
  } 
  System.out.println("父分类名称:"+ category.getName());
  Category parent = categoryMapper.getCategoryById(category.getParentId());
  printCategory(parent);
}

Under normal circumstances, this code is no problem.

But if someone makes a mistake and points the parentId of a category to itself, infinite recursion will occur. As a result, the interface cannot return data all the time, and stack overflow will eventually occur.

It is recommended to set a recursive depth when writing a recursive method. For example, if the maximum level of classification is 4, the depth can be set to 4. Then make a judgment in the recursive method. If the depth is greater than 4, it will automatically return, so that infinite recursion can be avoided.

8. The sql statement does not use the index

Have you ever encountered such a situation: it is obviously the same sql, only the input parameters are different. Sometimes the index a is taken, and sometimes the index b is taken?

That's right, sometimes mysql chooses the wrong index, and sometimes it doesn't even use the index.

Before executing a certain SQL statement, mysql will estimate the number of scanned rows through sampling statistics, and finally comprehensively evaluate which index to use based on the number of affected rows, degree of discrimination, cardinality, data pages and other information.

Sometimes parameter 1 is passed in, and the SQL statement goes to index a, and the execution time is very fast. But sometimes parameter 2 is passed in, and the sql statement goes to index b, and the execution time is obviously much slower.

This may cause a timeout problem on the API interface.

If necessary, force index can be used to force the query sql to go to a certain index.

9. Service OOM

I have encountered such a scenario before: an interface classified according to id query, the id is the primary key, and the sql statement can use the primary key index, but the interface timeout problem also occurred.

I thought it was a bit incredible at the time, because this interface took only a dozen milliseconds on average, how could there be a timeout?

However, judging from the logs at that time, the interface response time was 5 seconds, and there was indeed an interface timeout problem.

Finally, from the service memory monitoring of Prometheus, the OOM problem was found.

In fact, the service deployed by the API interface was suspended for a period of time due to OOM memory overflow.

At that time, all interfaces had request timeout problems.

However, since the K8S cluster is monitored, it will automatically kill the suspended service node and redeploy a new service node in the container. Fortunately, it does not cause much impact on users.

If you are more interested in OOM issues, you can read my other article "Oops, OOM Occurs in Online Services" .

10. In debug

Sometimes we need to develop tools locally, such as: in idea, directly connect to the database of the test environment, and debug the business logic of an API interface.

Because in the development environment, some problems are not easy to reproduce.

In order to troubleshoot a certain bug, you enable the debug mode when requesting a certain local interface, trace the code line by line, and troubleshoot the problem.

When I came to a certain line of code, I stayed for a long time. This line of code is mainly to update a certain piece of data.

At this time, the test students update the same data in the relevant business page.

This may also cause database deadlock problems.

Since you have never submitted a transaction in the debug mode of idea, it will cause a long deadlock time, which will cause a timeout problem in the API interface requested by the business page.

Of course, if you are more interested in conventional interface timeout issues, you can read my other article, which has a very detailed introduction.

Of course, if you are more interested in conventional interface timeout issues, you can read my other article "Talking about 11 Tips for Interface Performance Optimization" , which contains a very detailed introduction.

おすすめ

転載: blog.csdn.net/weixin_41544662/article/details/128722915
おすすめ