[SQL must know and know] - Lesson 11 using subqueries

Table of contents

subquery

filter with subquery

Format SQL

can only be a single column

Subqueries and performance

Using a subquery as a calculated field

Note: fully qualified column names

Hint: more than one solution

subquery

SELECT statements are SQL queries. All the SELECT statements we've seen so far have been simple queries, that is, single statements that retrieve data from a single database table.

SQL also allows the creation of subqueries, queries that are nested within other queries.

filter with subquery

SELECT cust_id
FROM Orders
WHERE order_num IN (SELECT order_num
                    FROM OrderItems
                    WHERE prod_id = 'RGAN01');

In a SELECT statement, subqueries are always processed from the inside out. In processing the above SELECT statement, the DBMS actually performs two operations.

First, it executes the following query:

SELECT order_num FROM orderitems WHERE prod_id='RGAN01'

This query returns two order numbers: 20007 and 20008. These two values are then passed to the WHERE clause of the outer query in the comma-separated format required by the IN operator. The outer query becomes:

SELECT cust_id FROM orders WHERE order_num IN (20007,20008)

Format SQL

SELECT statements containing subqueries are difficult to read and debug, especially if they are more complex. As shown above, breaking subqueries into multiple lines and indenting them appropriately can greatly simplify the use of subqueries.

By the way, this is where color coding comes into play, and good DBMS clients use color coded SQL for exactly this reason.

It can be seen that using subqueries in the WHERE clause can write powerful and flexible SQL statements. There is no limit to the number of subqueries that can be nested, but due to performance limitations in actual use, too many subqueries cannot be nested.

can only be a single column

A SELECT statement that is a subquery can only query a single column. Attempts to retrieve multiple columns will return an error.

SELECT cust_id
FROM Orders
WHERE order_num IN (SELECT order_num, order_id
                    FROM OrderItems
                    WHERE prod_id = 'RGAN01');

An error will occur in the above statement. There are two columns in the subquery, but the condition of the outer query has only one column. The mismatch will cause an error. If you want to match two columns, you can write as follows:

SELECT cust_id
FROM Orders
WHERE order_num IN (SELECT order_num
                    FROM OrderItems
                    WHERE prod_id = 'RGAN01')
AND   order_id IN (SELECT order_id
                    FROM OrderItems
                    WHERE prod_id = 'RGAN01');

Subqueries and performance

The code given here works and achieves the desired result. However, using subqueries is not always the most efficient way to perform this type of data retrieval. For more discussion, see Lesson 12, which again gives this example.

Using a subquery as a calculated field

SELECT cust_name,
        cust_state,
        (SELECT COUNT(*)
        FROM Orders
        WHERE Orders.cust_id = Customers.cust_id) AS orders
FROM Customers
ORDER BY cust_name;

This SELECT statement returns three columns for each customer in the Customers table: cust_name, cust_state, and orders. orders is a computed field that is built from the subquery enclosed in parentheses. This subquery is executed once for each customer retrieved. In this case, the subquery is executed 5 times because 5 customers were retrieved.

The WHERE clause in the subquery is slightly different from the WHERE clause used earlier in that it uses the fully qualified column name instead of just the column name ( cust_id ). It specifies the table and column names (Orders.cust_id and Customers.cust_id). The following WHERE clause tells SQL to compare the cust_id in the Orders table with the cust_id currently being retrieved from the Customers table:

WHERE Orders.cust_id = Customers.cust_id

Separate table and column names with a period, this syntax must be used when there is a possibility of confusing column names. In this example, there are two cust_id columns: one in Customers and one in Orders. Without fully qualifying the column names, the DBMS would think that the cust_id in the Orders table was being compared against itself. because

SELECT COUNT(*) FROM Orders WHERE cust_id = cust_id

always returns the total number of orders in the Orders table, which is not what we want:

SELECT cust_name,
    cust_state,
    (SELECT COUNT(*)
    FROM Orders
    WHERE cust_id = cust_id) AS orders
FROM Customers
ORDER BY cust_name;

Although subqueries are extremely useful in constructing such SELECT statements, care must be taken to limit ambiguous columns.

You can also use the form of table aliases to distinguish fields, for example, alias a for table 1 and alias b for table 2, and use a.field = b.field to distinguish field names.

Note: fully qualified column names

You've already seen why you should use fully qualified column names and return false results if you don't specify them, because the DBMS will misinterpret what you mean. Sometimes, the DBMS throws an error message due to ambiguity caused by conflicting column names. For example, a column name specified by a WHERE or ORDER BY clause may appear in more than one table. It is good practice to use fully qualified column names to avoid ambiguity if operating on multiple tables in a SELECT statement.

Hint: more than one solution

As mentioned earlier in this lesson, while the sample code presented here works well, it is not the most efficient way to solve this kind of data retrieval. We'll come across this example again when we study JOINs in the next two lessons.