SQL Prompt tutorial: Use [NOT] EXISTS instead of [NOT] IN for subqueries (PE019)

SQL Prompt is a practical SQL syntax prompt tool. SQL Prompt automatically searches according to the object name, grammar and code fragments of the database, and provides users with appropriate code selection. Automatic script settings make the code simple and easy to read-especially useful when developers are not familiar with scripts. SQL Prompt is installed and ready to use, which can greatly improve coding efficiency. In addition, users can customize it as needed to make it work in the way expected.

Click to download the official version of SQL Prompt

When comparing data sets using subqueries, it used to be that the EXISTS logical operator was faster than IN. For example, in the case that the query must perform a specific task, but only when the subquery returns any rows, and then when evaluating WHERE [NOT] EXISTS (subquery), the database engine can exit the search row as long as it finds one, and WHERE [ NOT] IN (subquery) will always collect all results from the subquery before further processing.

However, the query optimizer now treats EXISTS and IN in the same way as possible, so you are unlikely to see any noticeable performance differences. However, if the source data of the subquery contains NULL values, you need to be careful when using the NOT IN operator. If so, you should consider using the NOT EXISTS operator instead of NOT IN, or recast the statement as a left outer join.

The code analysis rules in SQL Prompt (PE019) contain suggestions to use [NOT] EXISTS instead of [NOT] IN.
Insert picture description here

Which is better: EXISTS or IN...?

There are two ways to calculate the difference between two data sets, but the two most common methods are to use EXISTS or IN logical operators. Imagine we have two simple tables, one table contains all common words in English (CommonWords), and the other table contains a list of all words in Bram Stoker's "Dracula" (WordsInDracula). The TestExistsAndIn download includes scripts to create these two tables and populate each of them and the text files associated with them. Usually, having such a table in a sandbox server is useful for running tests while doing development work, although you can choose the book to use!

How many uncommon words are there in Dracula? Assuming that NULL has no value in the CommonWords.Word column (more on that later), the following query will return the same result (1555 words) and have the same execution plan, which uses a merge join between the two ( Right Anti Semi Join) table.
--Using NOT IN
SELECT Count(*)
FROM dbo.WordsInDracula
WHERE word NOT IN (SELECT CommonWords.word FROM dbo.CommonWords);

–Using NOT EXISTS
SELECT Count(*)
FROM dbo.WordsInDracula
WHERE NOT EXISTS
(SELECT * FROM dbo.CommonWords
WHERE CommonWords.word = WordsInDracula.word);

Listing 1

In short, the SQL Server optimizer processes any query in the same way, and they will execute the same query.

…Or anything else (other than inner joins, outer joins or intersections)?
What are all other possible techniques, but, such as using ANY, EXCEPT, INNER JOIN, OUTER JOIN or INTERSECT? Listing 2 shows seven other alternatives that I can easily think of, although there are other alternatives.
--Using ANY
SELECT Count(*)
FROM dbo.WordsInDracula
WHERE NOT(WordsInDracula.word = ANY
(SELECT word
FROM commonwords ));
--Right anti semi merge join

–using EXCEPT
SELECT Count(*)
FROM
(
SELECT word
FROM dbo.WordsInDracula
EXCEPT
SELECT word
FROM dbo.CommonWords
) AS JustTheUncommonOnes;
–Right anti semi merge join

–using LEFT OUTER JOIN
SELECT Count(*)
FROM dbo.WordsInDracula
LEFT OUTER JOIN dbo.CommonWords
ON CommonWords.word = WordsinDracula.word
WHERE CommonWords.word IS NULL;
–right outer merge join

–using FULL OUTER JOIN
SELECT Count(*)
FROM dbo.WordsInDracula
full OUTER JOIN dbo.CommonWords
ON CommonWords.word = WordsinDracula.word
WHERE CommonWords.word IS NULL;
–Full outer join implemented as a merge join.

–using intersect to get the difference
SELECT (SELECT Count() FROM WordsInDracula)-Count()
FROM
(
SELECT word
FROM dbo.WordsInDracula
intersect
SELECT word
FROM dbo.CommonWords
) AS JustTheUncommonOnes;
–inner merge join

–using FULL OUTER JOIN syntax to get the difference
SELECT Count()-(SELECT Count() FROM CommonWords)
FROM dbo.WordsInDracula
full OUTER JOIN dbo.CommonWords
ON CommonWords.word = WordsinDracula.word
–full outer merge join

–using INNER JOIN syntax to get the difference
SELECT (SELECT Count() FROM WordsinDracula)-Count()
FROM dbo.WordsInDracula
INNER JOIN dbo.CommonWords
ON CommonWords.word = WordsinDracula.word
–inner merge join

Listing 2

Test harness

All these 9 queries give the same results, but is there a way that works better? Let's put them all into a simple testing tool and see how long each version takes! Again, the code download file includes the test tool code and all nine queries.

It turns out that although the query looks very different, it is usually just "syntactic sugar" for the optimizer. No matter how elegant your SQL is, the optimizer will just shrug and come up with an effective plan to execute it. In fact, the first four all use exactly the same "correct semi-consolidated" execution plan and all take the same time.
Insert picture description here

We will check the difference by running the test multiple times. The INTERSECT and INNER JOIN queries both use internal merge joins and are close. These two FULL OUTER JOIN queries are slightly slower, but this is a fierce match.
Insert picture description here

The trap of NOT IN

There is a certain degree of impracticality in comparing collections with null values, but if this happens during the daily database reporting fever, errors may occur. If a value in the result of a NULL subquery or expression is passed to the IN logical operator, it will give a reasonable response, and it will be the same as the equivalent EXISTS. However, the NOT IN behavior is quite different.

Listing 3 demonstrates this problem. We insert three common words and three less common words in the @someWord table variable, and we want to know the number of common words that are not in the table variable.
SET NOCOUNT ON;
DECLARE @someWord TABLE
(
word NVARCHAR(35) NULL
);
INSERT INTO @someWord
(
word
)
--three common words
SELECT TOP 3
word
FROM dbo.commonwords
ORDER BY word DESC;

– three uncommon words
INSERT INTO @someWord
(
word
)
VALUES
(‘flibberty’),
(‘jibberty’),
(‘flob’);

SELECT [NOT EXISTS without NULL] = COUNT(*)
FROM commonwords AS MyWords
WHERE NOT EXISTS
(
SELECT word FROM @someWord AS s WHERE s.word LIKE MyWords.word
);

SELECT [NOT IN without NULL] = COUNT(*)
FROM commonwords AS MyWords
WHERE word NOT IN (
SELECT word FROM @someWord
);

–Insert a NULL value
INSERT INTO @someWord
(
word
)
VALUES
(NULL);

SELECT [NOT EXISTS with NULL] = COUNT()
FROM commonwords AS MyWords
WHERE NOT EXISTS
(
SELECT word FROM @someWord AS s WHERE s.word LIKE MyWords.word
);
SELECT [NOT IN with NULL] = COUNT(
)
FROM commonwords AS MyWords
WHERE word NOT IN (
SELECT word FROM @someWord
);

Listing 3

In the NOT IN query, only insert NULL to @someword, and two NOT EXISTS queries, all correctly tell us that 60385 points are not in our table variable, because Santo has 60388 common words in all. However, if the subquery can return NULL, NOT IN returns no rows at all.

Insert picture description here

NULL really means "unknown" and not something, which is why any expression that compares with a NULL value will return NULL or unknown.
Logically speaking, SQL Server evaluates the subquery, replaces it with the list of values ​​it returns, and then evaluates the [NOT] IN condition. For the variation of IN our query, this will not cause problems because it can solve the following problems:
WHERE word ='flibberty' OR word ='jibberty' OR word ='flob'
OR word ='zygotes' OR word ='zygote 'OR word ='zydeco'
OR word = NULL;
For matches with the words "z...", 3 rows will be returned. With a thorn NOT IN, it can solve the following problems:
WHERE word <>'flibberty' AND word <>'jibberty'
AND word <>'flob' AND word <>'zygotes' AND word <>'zygote' AND word < >'zydeco'
AND word <> NULL;
AND the result of the condition to be compared is NULL as'unknown', so the expression will always return zero rows. This is not an error; it is by design. You can argue that NULL should not be used in any column to use the NOT IN expression, but in our actual work, these things may seep into the table source. It is worth being cautious. Therefore, use the EXISTS variant or other variants, or always remember that WHERE includes a clause in the IN condition to eliminate NULLs.

Guess you like

Origin blog.csdn.net/RoffeyYang/article/details/112268093