Problems and solutions encountered in stability construction

table of Contents

A reasonable setting of network timeout

Separation of core and non-core businesses

Three reasonably configure the number of tomcat threads

Try not to retry in the four codes

Five non-essential dependencies for weakening

Six database transaction streamlining

Seven SQL performance optimization points

8. Try to smooth the online process

Nine current limiting, fusing, degrading, queuing to handle abnormal traffic

Ten perfect log and monitoring functions


This article mainly explains some experiences and summaries on high availability and high performance construction.

A reasonable setting of network timeout

1.1 What is the network call timeout period?

For example, network requests between application servers, between application servers and redis servers, between application servers and mq servers, these network requests generally have three timeout periods:

  • connectRequestTimeout: The client gets the connection timeout time from the connection pool.
  • connectTimeout: The timeout period of the connection between the client and the server.
  • socketTimeout: The timeout time for the client and server to read data.

1.2 Why do I need to set the timeout period?

Because the system connection pool or thread pool resources are limited, assuming no timeout is set, due to slow downstream services or abnormal downstream services, there will be a large number of threads waiting foolishly for downstream services to return.

Some normal requests will wait or be rejected, service response will slow down, throughput rate will drop, QPS will be lower, and user experience will be worse. This situation can be avoided by setting a timeout.

1.3 How to set the timeout period reasonably?

The simple principle is: socketTimeout, connectTimeout, connectRequestTimeout 3 timeouts, not more than 300ms, as short as possible if the system can accept.

The timeout period can be set according to the 99 lines of the system . The so-called 99 line is the minimum time required to satisfy 99% of network requests. To put it simply, suppose we have an interface that requests 10,000 times a day,

The minimum time required to calculate 9900 requests is called 99 lines. For specific calculations, please refer to this article ( https://blog.csdn.net/brucewong0516/article/details/80205422 ).

Redis reads and writes normally in 2-3ms, and the timeout period needs to be set shorter, try not to exceed 50ms. The same is true for the timeout of mq.

Separation of core and non-core businesses

Every company has a core business and a non-core business. The core link can be sorted out for the core business. The so-called core link should be the company’s most valuable business. The core business can only call the core business on the link.

Non-core business can only call non-core business. If the company can, realize the core business dual computer room or even multiple computer room deployment.

Three reasonably configure the number of tomcat threads

Configure the number of threads reasonably. For example, CPU-intensive ones can be configured less, and IO-intensive ones can be configured more. For details, please refer to this article ( https://blog.csdn.net/jack1liu/article/details/100511226 ).

Try not to retry in the four codes

If there is no special reason, please do not retry in the code. The retry should be a business retry as far as possible, and the upstream business personnel will perform the retry operation.

Why not try again in the code?

If you try again in the code, this block is prone to flow amplification, usually 1 times the amount, if you retry 5 times, the flow will be 5 times the usual. It is easy to shut down the service.

Five non-essential dependencies for weakening

What is weak dependence?

The so-called weak dependence is to weakly depend on processes that have less impact on the main process.

For example, when a timeout exception occurs in mq/redis, if it does not affect the main function, you need to catch the exception and do not throw it to the upper layer. for example:     

String value = redis.get(“key”);
	if(value == null) {
		value = dao.getOneColumn(“”);
	}
}

If there is no catch weakly dependent on redis, when redis fails, an exception will be thrown directly to the upper layer, and data cannot be read from the database. 

In addition to catch for fault tolerance of mq, how to deal with lost messages when mq is unavailable? For example, change the mq to record the log and process it later.

Six database transaction streamlining

Operations within a transaction should be as few as possible to reduce transaction execution time, and there must be no RPC calls.

Seven SQL performance optimization points

7.1 How to define slow SQL?

In theory, the user-side SQL execution should be within 10ms, and more than 50ms can be classified as slow SQL.

7.2 What is the appropriate limit for the number of SQL queries?

For example, the limit limit is not allowed to exceed 100 or 200, and the id limit of in is also 100 or 200.

7.3 How to check that the newly added SQL has no problem?

Use explain to view the execution plan of SQL.

Add indexes to related query fields to speed up the query.

8. Try to smooth the online process

For example, for database migration, two core points must be taken into consideration in the program design and on-line process: maximum control of the impact area and rapid recovery .

How to control the influence area to the greatest extent?

  • You can consider the grayscale process and gradually increase the volume.
  • In order to verify the function, you can add a whitelist, etc.

How to restore the function quickly ?

  • Add some dynamic switches to quickly restore functions.
  • Prepare in advance and roll back the plan online.

Nine current limiting, fusing, degrading, queuing to handle abnormal traffic

The author has encountered the problem that the service is temporarily unavailable due to abnormal traffic. For details, please see ( https://blog.csdn.net/jack1liu/article/details/112135898 ).

Of course, fusing, degrading, and queuing technologies can also be used for abnormal traffic in the service. As long as the problem can be solved, it is ok.

Ten perfect log and monitoring functions

We should print reasonable and necessary log data.

10.1 What is a reasonable and necessary log?

Logs that can troubleshoot business problems and troubleshoot system problems are reasonable and necessary.

10.2 Why do we need to improve the monitoring function?

If there is no monitoring, we will feel that our services are actually running naked. If there is a problem, monitoring can help us find, reproduce and solve the problem faster and more effectively.

Guess you like

Origin blog.csdn.net/jack1liu/article/details/112647026