The blog park collapsed again. Should Alibaba Cloud take over this responsibility?

Blog Park collapsed again, but unlike the collapse of other major companies, the frequency of Blog Park crashes seems to be a bit high.

What's going on? What does it have to do with Alibaba Cloud? This article will take you to find out.

Whose problem is it?

Yesterday afternoon (December 8, 2023) Blog Park officially issued a fault announcement. The screenshot of the official website is as follows:

The failure of the Blog Park is that the database CPU is 100%. It has occurred 7 times this year. According to my observation as a person who does not often visit the Blog Park, it has also occurred in previous years, but the frequency does not seem to be so high.

It's happened 7 times and it still can't be solved. What's the problem?

According to my technical experience, the database CPU is 100%. This is usually caused by poor quality of writing some SQL. In some cases, a full table scan of a large amount of data may be performed, which is delayed in completing the execution and occupies CPU resources for a long time. .

It is said that this kind of problem can be solved by locating the corresponding SQL and changing the relevant statements, but it is this problem that stumps the blog park.

Parameter sniffing problem?

Take a look at the official explanation on this issue:

There are two important pieces of information here: the database of Blog Park uses SQL Server; the main query of Blog Park uses stored procedures. Blog Garden is based on .NET technology system, so it is more natural to use SQL Server; using stored procedures can improve the efficiency of SQL execution. Blog Garden was founded in 2008, which was also popular more than ten years ago; look at the paging it uses The method is also relatively new, which means it is constantly being optimized.

Officials suspect that the parameter sniffing problem caused SQL Server to cache an execution plan with extremely poor performance. There are two terms in this sentence: parameter sniffing problem and execution plan. Students who have not been exposed to it may be confused. Let me tell you first. Popularize it.

Execution plan : Each SQL statement will have an execution plan when executed inside the database, which mainly includes which table to query first, how the tables are related, which indexes to use during execution, etc.

Parameter sniffing problem : The stored procedure will be compiled first when executed for the first time, and the result of this compilation will be used in subsequent executions, instead of interpreting and executing each time, because compilation is relatively time-consuming. When compiling, the database will also determine an optimal execution plan based on the currently used stored procedure parameters, and cache this execution plan. This execution plan will be used directly during subsequent executions.

The problem mainly occurs in the cached execution plan, because for different parameters, the efficiency of the execution plan may vary greatly. This is mainly caused by the uneven distribution of query data.

I often encounter this problem in my company's business. Some users have more data, and some users have less data. Even if we set an index for the user ID field, the database sometimes still thinks that it is more efficient not to use this index. It You will choose a query path that you think is better, such as a full table scan, but slow SQL will occur during actual execution.

Here in the blog park, the official believes that one of their own stored procedures caused some slow SQL due to parameter sniffing issues. The slow SQL caused excessive CPU usage, and finally caused the database to crash.

The official has not located the problematic SQL or the problematic stored procedure. Maybe there are too many SQLs in the blog park, and there is more than one SQL that has the problem. Or is it a problem with SQL Server or Alibaba Cloud?

SQL Server problem?

As a commercial database, SQL Server can survive to this day, and the price is not low. Its product capabilities have been tested by the cruel market. Although there are inevitably some bugs in any product, the bugs that cause this kind of problem should not last for so long. Therefore, there should not be a big problem with SQL Server itself, or there should be no problem with SQL Server's data query method.

Many students also mentioned that the performance of SQL Server is not good. Based on my experience, the query performance of SQL Server is often much better than MySQL in similar scenarios. Many other users also have similar feedback:

I also specifically looked for some performance comparisons between SQL Server and other databases. The screenshots are as follows:

Article and data sources:

segmentfault.com/q/101000002…

www.ijarcce.com/upload/2015…

In addition, we can also get a glimpse from the monitoring log of the database shared by Blog Park:

As can be seen from the picture, the time when the problem occurs is relatively random, and it is not a peak period. Blog Park also mentioned that similar problems occurred at 4-5 o'clock in the morning. It seems that the CPU usage is only a little over 20%, so it is not a performance bottleneck.

Alibaba Cloud problem?

Why might Alibaba Cloud take the blame? Because Blog Park is deployed on Alibaba Cloud, both the server and database use Alibaba Cloud products.

I remember that when this problem occurred before, Blog Park officials criticized Alibaba Cloud a lot. Later, the two parties may have had in-depth exchanges. Blog Park accepted the problem of parameter sniffing and has been searching in this area since then.

So can Alibaba Cloud completely cut off the relationship?

Under normal circumstances, the SQL Server deployed on Alibaba Cloud should be purchased from Microsoft, and Microsoft should also provide some technical support, including installation and daily operation and maintenance support. This SQL Server may be slightly different from the one deployed on Azure, but Microsoft will not damage its own brand, and there should not be any major problems with the database version.

Alibaba Cloud only deploys and operates SQL Server. To put it bluntly, Alibaba Cloud only provides underlying storage, network, operating system and other services. The upper-layer database applications are entirely Microsoft's, and they cannot get involved. This kind of database program consumes 100% of the CPU. It is difficult to link Baidu's failures to what Alibaba Cloud did.

In addition, Alibaba Cloud also develops its own database. Although SQL Server is not open source, experts should be familiar with some underlying designs or possible problems. There are many users of SQL Server services on Alibaba Cloud. If many companies have encountered this problem, it should have been exposed and solved long ago.

Therefore, it is relatively difficult to blame Alibaba Cloud for this problem. Of course, there is no way to completely rule it out. After all, there are always some extreme situations. Alibaba Cloud has crashed many times recently. Is there something wrong in some aspects? No one knows either.

How to solve the problem?

Change database?

As mentioned above, it is unlikely that the problem occurs in the database itself. Moreover, changing the database requires rewriting all SQL and possibly modifying the table structure. This workload is not trivial.

If it is really a parameter sniffing problem, the execution plan efficiency will still be inconsistent if the database is changed.

Change cloud?

This basically means that Alibaba Cloud is not capable.

If you really suspect that this is a problem, you can try it, but instead of migrating it directly, you can export a copy of the data and put it on another public cloud, or deploy a SQL Server locally.

Then collect the SQL execution logs and replay the execution in the test database. If the problem still occurs, it is not a problem with the cloud vendor. If the problem does not occur after running for a long time, then there is a basis to say that there is a problem with the cloud service. The probability is relatively high.

Of course, the cost of this test is relatively high. You may be able to speed up the test by streamlining the sample or increasing the frequency of SQL execution.

As a technical person, you must be well-founded when throwing blame.

Or they are unreasonable and insist on Alibaba Cloud. Either it is your problem, or you can help me find the problem. Sometimes the technical team of the cloud vendor can come to your door or have close communication in other ways. How about spending some money to find a master? Maybe the blog park is too honest? Or is Alibaba Cloud too arrogant? Or maybe the blog park is too poor?

Solving slow SQL problems

Alibaba Cloud's problems can only be guessed. Parameter sniffing problems or slow SQL problems can indeed be solved. Alibaba Cloud's database products provide slow SQL log queries.

You only need to find out the slow SQL when the problem occurs. Looking at the blog park's previous fault announcements, we have also caught some problematic SQL.

But why do problems keep appearing?

It is possible that the problem is too much SQL. After more than ten years of iteration, the amount of code in Blog Park may be very large. In addition, Blog Park has been difficult to operate in the past two years. There is no manpower and energy to invest in this aspect. We can only check back and correct problems when they arise. It would be good to be alive. It is estimated that there are no skilled people in the team, and all the energy is focused on survival.

Let’s stop here for the specific reasons why it has not been resolved.

Let me talk to you about how to solve the problem of parameter sniffing. I think this is the most important thing for technical students.

We have said above that the parameter sniffing problem is that the database uses inefficient execution plans, so the core idea to solve this problem is to prevent the database from using these inefficient plans. Here are some methods I know.

violent cleanup

Restarting the server and restarting the database are almost the same methods adopted by the Blog Park.

There is also a slightly more elegant solution to clear all execution plan caches: DBCC FREEPROCCACHE , regardless of whether there are problems with these execution plans. But I am not sure whether this command can be executed on Alibaba Cloud's database service.

These are all methods of forcing re-creation of the execution plan. The disadvantage is that the impact is relatively large, and it is likely to affect users' use of services, which is relatively violent.

Moreover, these methods cannot cure the problem and can only provide short-term relief. Maybe at some point, the execution plan will be rebuilt again, or the SQL execution will time out again.

Elegant mechanism

SQL Server itself also has some elegant solutions to alleviate this problem. for example:

  • Without caching the execution plan, although caching can bring some efficiency improvements, the performance loss caused by the parameter sniffing problem is negligible. You can use WITH RECOMPILE in a stored procedure to have the query recompiled every time.
  • Forcing the use of a certain query plan, such as forcing the use of a certain index, this index will not be too bad for all queries; SQL Server can also force the use of a query plan with certain conditions. However, it may be difficult to find this index or condition, because the data is constantly changing. Just because the data is good now does not mean it will always be good.
  • Only clear the query cache for specific statements or stored procedures, and use DBCC FREEPROCCACHE(@plan_id) to specify the execution plan, which has less impact.
  • In addition, stale table statistics, index fragmentation, and missing indexes may cause parameter sniffing problems. When encountering problems, you can investigate from these aspects.

For details, please refer to this article by Ali:  mysql.taobao.org/monthly/201…

Evaluate carefully

When we design tables and write SQL, we need to consider how the data will be distributed and what conditions the query will have, especially when the data may be unevenly distributed.

For example, the amount of data for some users may be 10 times or even a hundred times that of most users. The sorted fields may cause indexes containing conditional fields not to be used, and queries may drift between multiple indexes.

If there may be problems, we must consider how to design the table and how to query the data. When ordinary relational databases are difficult to solve, we can also consider using NoSQL, distributed databases and other solutions to stabilize query efficiency.

Guess you like

Origin blog.csdn.net/m0_60961651/article/details/135263813