How to Write Better SQL Queries: The Ultimate Guide - Part Three



 

This time we are learning the last article in the "How to Write Better SQL Queries" series.

 

Time Complexity and Big O Notation

From the first two articles, we already have some understanding of query plans. Next, we can also use computational complexity theory to dig deeper and think about performance improvements. The field of theoretical computer science focuses on classifying computational problems according to their difficulty. These computational problems can be either algorithmic or query problems.

For queries, we can categorize not by difficulty, but by the time it takes to run the query and get the result. This approach is also known as sorting by time complexity.

Using Big O notation, the running time can be expressed in terms of how fast the input grows, since the input can be arbitrarily large. Big O notation excludes coefficients and lower-order terms so that you can focus on the important part of query runtime: the growth rate. When used this way, coefficients and low-order terms are discarded, and the time complexity is described gradually, which means that the input becomes infinite.

In database languages, complexity measures how long a query takes to run.

Note that the size of the database not only increases with the data stored in the tables, the indexes in the database also affect the database size.

 

Estimating the time complexity of a query plan

The execution plan defines the algorithm used for each operation, which also allows the execution time of each query to be logically expressed as a function of the size of the data table in the query plan. In other words, the complexity and performance of a query can be estimated using Big O notation and execution plans.

In the following summary, we will understand the four types of time complexity concepts.

From these examples, you can see that the time complexity of the query can vary depending on the content of the query being run.

For different databases, different indexing methods, different execution plans and different implementation methods need to be considered.

So the time complexity concepts listed below are very general.

O(1): constant time

A query algorithm that takes the same amount of time to execute regardless of the size of the input is a constant-time query. These types of queries are not common, here is an example:

SELECT TOP 1 t.*
FROM t

The time complexity of this algorithm is a constant because just select any row from the table. Therefore, the length of time is independent of the size of the table.

线性时间:O(n)

如果一个算法的时间执行与输入大小成正比,那么算法的执行时间会随着输入大小的增加而增加。对于数据库,这意味着查询执行时间与表大小成正比:随着表中数据行数的增加,查询时间也会相应增加。

一个示例就是在非索引列上使用WHERE子句进行查询:这就需要使用全表扫描或顺序扫描,这将导致O(n)的时间复杂度。这意味着需要读取表中的每一行,以便找到正确ID的数据。即使第一行就查找到了正确的数据,查询还是会对每一行数据进行读取。

如果没有索引,那么这个查询的复杂度为O(n)i_id:

SELECT i_id
FROM item;
  • 这也意味像COUNT(*) FROM TABLE这样的计数查询,具有O(n)的时间复杂度,除非存储了数据表的总行数,否则就会进行全表扫描。此时,复杂度将更像是O(1)。

与线性执行时间密切相关的是,所有线性执行计划的时间总和。下面是一些例子:

  • 哈希连接(hash join)的复杂度为O(M + N)。两个内部数据表连接的经典哈希连接算法是,首先为较小的数据表准备一个哈希表。哈希表的入口由连接属性和行组成。通过将hash函数应用于join属性,来实现哈希表的访问。一旦构建了哈希表,就会扫描较大的表,并通过查看哈希表来查找较小表中的相关行。
  • 合并连接(merge join)的复杂度为O(M + N),但是这种连接严重依赖于连接列上的索引,并且在没有索引的情况下,会根据连接中使用的key对行先进行排序:
    • 如果根据连接中使用的key,对两个表进行了排序,那么查询的复杂度为O(M + N)。
    • 如果两个表都有连接列上的索引,则索引会按顺序维护这些列,同时也不需要进行排序。此时复杂度为O(M + N)。
    • 如果两个表都没有连接列上的索引,则需要先对两个表进行排序,因此复杂度会是O(M log M + N log N)。
    • 如果一个表的连接列上有索引,而另一个表没有,则需要先对没有索引的表进行排序,因此复杂度会是O(M + N log N )。
  • 对于嵌套连接,复杂度通常为O(MN)。当一个或两个表非常小(例如,小于10个记录)时,这种连接方式特别有效。

请记得:嵌套连接是将一个表中的每个记录与另一个表中的每个记录进行比较的连接方式。

对数时间:O(log(n))

如果算法的执行时间与输入大小的对数成比,则算法被称为对数时间算法; 对于查询,这意味着执行时间与数据库大小的对数成正比。

执行索引扫描(index Scan)或聚集索引扫描的查询计划时间复杂度,就是对数时间。聚集索引是索引的叶级别包含表的实际数据行的索引。聚集与其他索引非常相似:它是在一个或多个列上定义的。这也形成了索引主键。聚集主键是是聚集索引的主键列。聚集索引扫描是聚集索引中RDBMS从头到尾一行一行读取的基本操作。

以下的示例中存在一个i_id的索引,这也导致O(log(n))的复杂度:

SELECT i_stock
FROM item
WHERE i_id = N;

如果没有索引,则时间复杂度是O(n)。

二次时间:O(n ^ 2)

如果算法的执行时间与输入大小的平方成正比,则算法被称为对数时间算法。对于数据库,这意味着查询的执行时间与数据库大小的平方成正比。

具有二次时间复杂度的查询的示例如下:

SELECT *
FROM item, author
WHERE item.i_a_id=author.a_id

最小复杂度为O(n log(n)),但是基于连接属性的索引信息,最大复杂度会是O(n ^ 2)。

下图是一张根据时间复杂度来估算查询性能的图表,通过图表可以查看每个算法的性能表现。

 

 

SQL调优

可以从以下方面衡量查询计划和时间复杂性,并进一步调优SQL查询:

  • 用索引扫描替换不必要的大数据表的全表扫描;
  • 确保表的连接顺序为最佳顺序;
  • 确保以最佳方式使用索引;
  • 将小数据表的全表扫描缓存起来。

《如何编写更好的SQL查询》教程的所有内容就介绍到这里,希望通过本教程的介绍,能够帮助大家编写出更好、更优的SQL查询。

原文链接:https://www.datacamp.com/community/tutorials/sql-tutorial-query#importance

转载请注明出自:葡萄城控件

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326225530&siteId=291194637