[Announcement] fault database server CPU failure caused nearly 100% (due to a bug 3.0 .NET Core's) path to cloud computing - on Ali says: Database connections over million for the truth, Ali cloud from RDS to the Microsoft .NET Core

Sorry, this failure to bring you trouble, please understand.

Around 10:54 this morning, we use the database service (RDS Ali cloud instance of SQL Server 2016 Standard Edition) CPU suddenly soared to more than 90%, extensive database appear in the application log query timeout errors.

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258

When we receive an alarm notification and confirmed the problem, starting at 11:06 of the main aliyun RDS standby switching, 11:08 to complete the switch, the database CPU back to normal. But the key is always worse when docker swarm, after the database is returned to normal, the deployment of blog sites have a docker swarm cluster node abnormal situation, 50x wrong part of the request will appear this abnormal node from the cluster and start a new node after 11 : 15 returned to normal.

By CloudDBA Ali cloud RDS console found unusually large number of executions during the nearly 100% CPU SQL statements.

SELECT TOP @__p_1 [b].[TagName] AS [Name], [b].[TagID] AS [Id], [b].[UseCount], [b].[BlogId]
FROM [blog_Tag] [b]
WHERE [b].[BlogId] =  @__ blogId_0 
    AND  @__ blogId_0  IS  NOT  NULL
     AND  [ b ] . [ UseCount ]  > ?
ORDER  BY  [ b ] . [ UseCount ]  DESC

The SQL statement above EF Core 3.0 is generated, which bold   IS NOT NULL   is EF Core 3.0 is not a notorious notorious bug - will generate additional generating SQL statements   IS NOT NULL   query.

Who would have thought (even Microsoft itself did not expect) but this seemingly innocuous superfluous fatal risks - will make the entire database server CPU in some cases continued 100% (or nearly 100%). When a start experiencing this problem, we did not expect, and therefore also wrong to accuse Ali cloud ( Bowen link ), then analyzes the problems we face in Ali cloud database experts after the discovery of the original culprit is generated excess EF Core "iS NOT NULL", it will in some cases result in SQL Server cache performance is extremely low (very consuming CPU) implementation plan, and then follow-up inquiries are taking this implementation plan, CPU will be high. This error execution plan has a dual mass destruction, while consumption giant database CPU, while resulting in a corresponding query can not be completed properly and thus can not be cached query results to memcached, the more so for this query execution plan, the avalanche effect occurs. The only solution is to remove the wrong execution plan cache, the switchover or reboot the server just clearing the cache of the implementation plan simple and crude methods.

Before we begin to encounter this problem, it has already been on github feedback this question:

Yeah this needs to be fixed asap. We just deployed code that uses 3.0 and had to immediately revert to 2.2 because simple queries blew up our SQL Azure CPU usage. Went from under 50% to 100% and stayed there until we rolled back.

But Microsoft did not cause enough to try again, after we know wronged Ali cloud actually is Microsoft's problem, we feedback the problem to Microsoft .NET team, this time to get Microsoft's attention was soon repaired, but was released by .NET Core 3.0 Preview version we tested in a non-production environment   iS nOT NULL indeed fixes, as is the Preview version, plus .NET Core 3.1 final version will be released before the end of the year, so we do not have the production update the repair environment, the problem will only last occurrence of complex SQL statements with Dapper instead call a stored procedure. Later, Ali cloud database experts to further our database for analysis, even the usual database CPU burr (occasionally run high volatility) are associated with   IS NOT NULL   related.

This is the background of this failure, we wait for the official version of .NET Core 3.1 bug fix this process has been a pit, and last difference is that this problem is very simple SQL statements, and only one "IS NOT NULL ", we can see the pit of destruction.

The pit is sufficient manned .NET Core annals, another let us remember also that we are wrong about Ali cloud .NET Core pit is the official version of .NET Core in SqlClient actually missed a Dispose, see the cloud calculation of road - Ali says: database connections over million for the truth, from Ali cloud RDS Core to Microsoft the .NET .

Guess you like

Origin www.cnblogs.com/cmt/p/11916927.html