6.25,6.26 (last weekend) enterprise downtime problem analysis

 

@Operation and maintenance group
Before deciding to restore the problem by restarting due to performance or other problems, do one thing:  find one or two applications that will be restarted, execute kill -3 pid, dump the thread, preferably three times, then restart
This thread dump is for developers to observe the threads in the JVM when an error occurs, the number of threads, their role, their status, the reason for waiting, and whether there is a deadlock.
 
=================================
6.25,6.26 (last weekend) Enterprise downtime problem analysis:
 
Process description: 1) On Saturday afternoon, the enterprise system responded slowly and the seller's homepage crashed. Except for the enterprise system, bcwap, Qianbao app, Qianbao api, and the task database were all alarming.
                      Later, after the operation and maintenance rolled back the Qianbao api, all other systems except enterprise were restored. The problem of enterprise restarting many times could not be solved. The enterprise version was rolled back urgently, and the CAT monitoring function that was launched on Friday was removed, and enterprise resumed work.
                
                    2)   The above problem reappeared at noon on Sunday. Enterprise once again experienced slow system response, high errors, and the seller's homepage crashed.
                      At this time, besides the enterprise system, bcwap is under a lot of pressure, dragging down www.qbao. After blocking bcwap, www.qbao returns to normal. Restarting enterprise can solve the problem for a short time. After 2 minutes, the memory is exhausted and the memory is increased to 12G.
                      The system could not be restored as well. Later, a rapid increase of machines was adopted. After expanding the online enterprise app to 12 sets, the problem was solved and the system was restored.
 
Analysis and conclusions:
1 The enterprise version was rolled back on Saturday, the CAT monitoring capability was removed, and the system was restored. It is not ruled out that there is a problem with the CAT monitoring function. This is just a suspicious direction, which may have nothing to do with CAT, but after the removal of CAT. , the system on which the enterprise app depends has just recovered,
    This leads to "misjudgment is that CAT has a problem". Considering that all people are B, the seller's homepage carries a large number of functional entrances, and subsequent monitoring and online need to start with other less important systems and gradually advance.
 
2   The reason for the downtime of the enterprise is because the business thread is stuck, and a large number of business threads are in a waiting state. In severe scenarios, the thread pool is exhausted, the request queue is full, and the service is directly refused.
    During the period of enterprise error, we found that www.qbao was under a lot of pressure and the response was slow. On the front-end pages such as the seller's homepage, enterprise would regularly initiate a request to www.qbao (there are scheduled tasks at the front desk) to obtain the user's announcement information. This remote call Based on HTTP, and SO_TIMEOUT is not set, in the case that www.qbao does not give a response, the business thread will always wait for a response (the word "always" is not rigorous enough, because the version of httpClient can be different, and the enterprise waits for a long time. Thousands of seconds are possible), at this time, the browser does not receive a response, and the business thread will not be released.  Front-end requests are continuously sent to the background, because the original business thread is in a waiting state, and the background constantly allocates new threads. The thread continues to wait, and then the number of threads increases sharply. The background thread waits a lot because of obtaining the data of www.qbao, which directly causes problems with other requests to enterpsie at this time, which interferes with the access of the whole site. 

 

  a ) When an error occurs, the /sellerCenter/queryAnnounceAndIMInfo interface viewed through Tingyun takes more than tens of hundreds of seconds, and many pages such as the seller's homepage will regularly pull the announcement information. This interface is frequent and slow to respond. Combined with how tomcat works,
       In response to this interface, a large number of threads will be occupied in the background
  
  Even now this interface is still very slow. When the system crashes, I see that this url on Tingyun accounts for more than 70% of the slow call process.
b) Because when an error occurs, the thread is not dumped in time, and there is no need to listen to the cloud to check the working status of the thread at this time, so the above conclusions are calculated statically, but combined with the working principle of tomcat, the conclusion should be credible
 
3) Why does OOM occur? 
    After fullGC, it is found that the memory is still not enough, and OOM will appear
   The OOM seen by the operation and maintenance may really be because the traffic is too large, and the application can't carry it, but it can't carry it when it is adjusted to 12G, which is very strange.
 
4) Adding machines is a way to solve the problem
   More machines and more request processing resources have obvious benefits for business scenarios that are suitable for horizontal expansion to improve processing capacity.
   However, in the past, there were only 5 enterprise app online to support the request. Now it has been changed to 12. It is found that the CPU and memory utilization of each machine are very low, which should be a waste of the machine.
   There is machine waste, which means that the operation and maintenance can continue to put other applications on the machine of the enterprise app, but it does not mean that the number of machines needs to be reduced.
   By reducing the number of machines, the CPU utilization can be improved, but the GC pressure will also increase. Considering the importance of the enterprise system and the capacity design of the system, it is necessary to consider normal scenarios and extreme scenarios. It is not recommended to adjust the deployment in order to save machines. picture
 
=============================
Solution:
1 The timeout period must be controlled to prevent the service provider from responding slowly and causing the caller to wait for a long time.
    --- Timeout setting, distinguishing business scenarios, some interfaces require a long timeout, some short timeout, it is easy to achieve method-level control with dubbo
         Use httpCLient, write it into utils, and use a generally acceptable time
 
2 System optimization
   a) Only RPC is discussed here:   for some data that is not so real-time, the request result can be cached on the client side, the expiration time can be set, and the number of calls to external systems can be reduced.
                                   If the real-time performance is relatively high, it needs to be called in real time. At this time, the caller needs to ensure the performance.
  b) Optimization related to various configurations of enterprise, these parameters should also have reference significance for other groups
       Adjusting these parameters is to have a reasonable balance between the number of threads in the business thread pool and the number of competing resources (database connections, http connections).
         1) The server.xml configuration of the online machine 
            maxThreads="8000"   is adjusted to 2048, and it is less than 200 under normal circumstances when I read it online
        2) Database connection pool configuration
            The current online enterprise's " Maximum number of connections supported by the online database configuration" is 5000
            In the configuration item of jdbc.properites, the maxConnection is 200, and there are 12 online sets in total. Increase the maxConnections of the app to 400, which is doubled.
            Let the DBA look at the database connections. There were only 46 active connections yesterday afternoon, and the database pressure is very small.
        3) Standardize the use of httpClient
           Internal calls are not recommended to re-establish the connection every time to release the connection. The http connection pool should be reused. The httpClient is a relatively heavy object and needs to be shared. Otherwise, when the traffic is large, it will be opened and closed frequently. additional unnecessary burden
          several parameters
            HttpConnectionParams.setConnectionTimeout(httpParams, 15000);
HttpConnectionParams.setSoTimeout(httpParams, 60000);
HttpConnectionParams.setTcpNoDelay(httpParams, true);
connectionManager.setMaxTotal(64 * 8*2);  //The number of online sockets created is 1024, and it used to be 512
connectionManager.setDefaultMaxPerRoute(256); // A maximum of 256 can be created for each specified connection server (specified IP) socket for access

       4)  The dubbo configuration is currently the default priority. Except for configuring the timeout to 60S, the configuration of parameters such as threads is not optimized. The throughput efficiency of asynchronous IO is very high.
       5) On-line uniformly adjust the GC parameters to  -Xms8196m -Xmx8196m -Xmn4000m tomcat      that has been adjusted to 12G before, and readjust to the above configuration,
           For the memory with young generation of 8G, it is observed that the throughput efficiency of the machine is no different from that of 4G, and the memory cannot be used up at all, and the time-consuming of minorGC reaches 200ms, which  exceeds 100ms of 4G.
      
3) Necessary pressure test:
  1 single enterprise online configuration, idling performance
  2 Pressure test the performance of the home page
 
4) Necessary current limiting? 
       Prevent stress transmission? 
       Guaranteed core business?
   
=======================================
      

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326839504&siteId=291194637