Troubled times - a recent issue on Ali says: load balancing failure, the server CPU 100%, the DDoS attack

Yesterday, 22:00 to 22: 30 and 23:30 ~ 00: 30, 1 table service for many years, Ali cloud load balancing sudden failure, caused by this user load balancing visit the blog site suffered 502, 503, 504, this may have caused trouble, please understand.

Problem is very strange, from the performance, it seems load balancing within the network communication between the server and back-end problem. Sometimes health check is successful, but forward the request to the backend server will fail; the back-end server obviously normal, but sometimes health check failed; the worst of times, all the back-end server health check failed. While others use the same back-end server load balancing did not have this problem, and ultimately through load balancing off the assembly line this solved the problem.

This load balancing is our 2013 purchase just on cloud Ali, served for many years, this has never been a problem before, and now it seems it will be forced to retire.

Yesterday morning found us in addition to the blog site for the deployment of 100% for all other applications docker swarm cluster all the servers CPU.

The CPU 100% of the CPU 100% and usually there is great, even though 100%, but does not affect the normal operation of the application. March this year also encountered the same problem, when viewed through the top command is sy (system cpu time spent in kernel space) takes up a lot of CPU, and later re-deploy applications addressed by all worker nodes in the cluster server restarts.

This morning we have taken a method for restarting node server, restart the server CPU back to normal. But during the operation, flash memory application container there is a problem, resulting in about 15 minutes flash site access is not normal, this gives you trouble, please understand.

Recently, the blog site suffered repeated DDoS attacks, the highest attack traffic reached nearly 80G. Ali cloud attack will be a shield 30 minutes, although we have taken emergency measures, but all to take effect for about 10 minutes, so each time a user is likely to be affected by the attack about 10 minutes to resume normal access, which brings you your understanding, please understand.

In this troubled times, there have been many faults site, give us a lot of trouble, I implore you for your understanding.

The troubled times, is also a test for us, we will learn to further improve ourselves in the coming harder and we build a more vibrant garden.

Guess you like

Origin www.cnblogs.com/cmt/p/11582653.html