SQL文摘：Writing Subqueries in SQL

原文地址：https://community.modeanalytics.com/sql/tutorial/sql-subqueries/

Subquery basics

Subqueries (also known as inner queries or nested queries) are a tool for performing operations in multiple steps. For example, if you wanted to take the sums of several columns, then average all of those values, you’d need to do each aggregation in a distinct step.

Subqueries can be used in several places within a query, but it’s easiest to start with the FROM statement. Here’s an example of a basic subquery:

SELECT sub.*
  FROM (
        SELECT *
          FROM tutorial.sf_crime_incidents_2014_01
         WHERE day_of_week = 'Friday' ) sub WHERE sub.resolution = 'NONE'

Let’s break down what happens when you run the above query:

First, the database runs the “inner query”—the part between the parentheses:

SELECT *
  FROM tutorial.sf_crime_incidents_2014_01
 WHERE day_of_week = 'Friday'

If you were to run this on its own, it would produce a result set like any other query. It might sound like a no-brainer, but it’s important: your inner query must actually run on its own, as the database will treat it as an independent query. Once the inner query runs, the outer query will run using the results from the inner query as its underlying table:

SELECT sub.*
  FROM (
       <<results from inner query go here>> ) sub WHERE sub.resolution = 'NONE'

Subqueries are required to have names, which are added after parentheses the same way you would add an alias to a normal table. In this case, we’ve used the name “sub.”

A quick note on formatting: The important thing to remember when using subqueries is to provide some way to for the reader to easily determine which parts of the query will be executed together. Most people do this by indenting the subquery in some way. The examples in this tutorial are indented quite far—all the way to the parentheses. This isn’t practical if you nest many subqueries, so it’s fairly common to only indent two spaces or so.

Write a query that selects all Warrant Arrests from the tutorial.sf_crime_incidents_2014_01 dataset, then wrap it in an outer query that only displays unresolved incidents.

Try it out See the answer

The above examples, as well as the practice problem don’t really require subqueries—they solve problems that could also be solved by adding multiple conditions to the WHERE clause. These next sections provide examples for which subqueries are the best or only way to solve their respective problems.

Using subqueries to aggregate in multiple stages

What if you wanted to figure out how many incidents get reported on each day of the week? Better yet, what if you wanted to know how many incidents happen, on average, on a Friday in December? In January? There are two steps to this process: counting the number of incidents each day (inner query), then determining the monthly average (outer query):

SELECT LEFT(sub.date, 2) AS cleaned_month,
       sub.day_of_week,
       AVG(sub.incidents) AS average_incidents FROM ( SELECT day_of_week, date, COUNT(incidnt_num) AS incidents FROM tutorial.sf_crime_incidents_2014_01 GROUP BY 1,2 ) sub GROUP BY 1,2 ORDER BY 1,2

If you’re having trouble figuring out what’s happening, try running the inner query individually to get a sense of what its results look like. In general, it’s easiest to write inner queries first and revise them until the results make sense to you, then to move on to the outer query.

Write a query that displays the average number of monthly incidents for each category. Hint: use tutorial.sf_crime_incidents_cleandate to make your life a little easier.

Try it out See the answer

Subqueries in conditional logic

You can use subqueries in conditional logic (in conjunction with WHERE, JOIN/ON, or CASE). The following query returns all of the entries from the earliest date in the dataset (theoretically—the poor formatting of the date column actually makes it return the value that sorts first alphabetically):

SELECT *
  FROM tutorial.sf_crime_incidents_2014_01
 WHERE Date = (SELECT MIN(date) FROM tutorial.sf_crime_incidents_2014_01 )

The above query works because the result of the subquery is only one cell. Most conditional logic will work with subqueries containing one-cell results. However, INis the only type of conditional logic that will work when the inner query contains multiple results:

SELECT *
  FROM tutorial.sf_crime_incidents_2014_01
 WHERE Date IN (SELECT date FROM tutorial.sf_crime_incidents_2014_01 ORDER BY date LIMIT 5 )

Note that you should not include an alias when you write a subquery in a conditional statement. This is because the subquery is treated as an individual value (or set of values in the IN case) rather than as a table.

Joining subqueries

You may remember that you can filter queries in joins. It’s fairly common to join a subquery that hits the same table as the outer query rather than filtering in the WHEREclause. The following query produces the same results as the previous example:

SELECT *
  FROM tutorial.sf_crime_incidents_2014_01 incidents
  JOIN ( SELECT date FROM tutorial.sf_crime_incidents_2014_01 ORDER BY date LIMIT 5 ) sub ON incidents.date = sub.date

This can be particularly useful when combined with aggregations. When you join, the requirements for your subquery output aren’t as stringent as when you use the WHEREclause. For example, your inner query can output multiple results. The following query ranks all of the results according to how many incidents were reported in a given day. It does this by aggregating the total number of incidents each day in the inner query, then using those values to sort the outer query:

SELECT incidents.*,
       sub.incidents AS incidents_that_day
  FROM tutorial.sf_crime_incidents_2014_01 incidents
  JOIN ( SELECT date, COUNT(incidnt_num) AS incidents FROM tutorial.sf_crime_incidents_2014_01 GROUP BY 1 ) sub ON incidents.date = sub.date ORDER BY sub.incidents DESC, time

Write a query that displays all rows from the three categories with the fewest incidents reported.

Try it out See the answer

Subqueries can be very helpful in improving the performance of your queries. Let’s revisit the Crunchbase Data briefly. Imagine you’d like to aggregate all of the companies receiving investment and companies acquired each month. You could do that without subqueries if you wanted to, but don’t actually run this as it will take minutes to return:

SELECT COALESCE(acquisitions.acquired_month, investments.funded_month) AS month,
       COUNT(DISTINCT acquisitions.company_permalink) AS companies_acquired, COUNT(DISTINCT investments.company_permalink) AS investments FROM tutorial.crunchbase_acquisitions acquisitions FULL JOIN tutorial.crunchbase_investments investments ON acquisitions.acquired_month = investments.funded_month GROUP BY 1

Note that in order to do this properly, you must join on date fields, which causes a massive “data explosion.” Basically, what happens is that you’re joining every row in a given month from one table onto every month in a given row on the other table, so the number of rows returned is incredibly great. Because of this multiplicative effect, you must use COUNT(DISTINCT) instead of COUNT to get accurate counts. You can see this below:

The following query shows 7,414 rows:

SELECT COUNT(*) FROM tutorial.crunchbase_acquisitions

The following query shows 83,893 rows:

SELECT COUNT(*) FROM tutorial.crunchbase_investments

The following query shows 6,237,396 rows:

    SELECT COUNT(*)
      FROM tutorial.crunchbase_acquisitions acquisitions
      FULL JOIN tutorial.crunchbase_investments investments ON acquisitions.acquired_month = investments.funded_month

If you’d like to understand this a little better, you can do some extra research on cartesian products. It’s also worth noting that the FULL JOIN and COUNT above actually runs pretty fast—it’s the COUNT(DISTINCT) that takes forever. More on that in the lesson on optimizing queries.

Of course, you could solve this much more efficiently by aggregating the two tables separately, then joining them together so that the counts are performed across far smaller datasets:

SELECT COALESCE(acquisitions.month, investments.month) AS month,
       acquisitions.companies_acquired,
       investments.companies_rec_investment
  FROM ( SELECT acquired_month AS month, COUNT(DISTINCT company_permalink) AS companies_acquired FROM tutorial.crunchbase_acquisitions GROUP BY 1 ) acquisitions FULL JOIN ( SELECT funded_month AS month, COUNT(DISTINCT company_permalink) AS companies_rec_investment FROM tutorial.crunchbase_investments GROUP BY 1 )investments ON acquisitions.month = investments.month ORDER BY 1 DESC

Note: We used a FULL JOIN above just in case one table had observations in a month that the other table didn’t. We also used COALESCE to display months when the acquisitions subquery didn’t have month entries (presumably no acquisitions occurred in those months). We strongly encourage you to re-run the query without some of these elements to better understand how they work. You can also run each of the subqueries independently to get a better understanding of them as well.

Write a query that counts the number of companies founded and acquired by quarter starting in Q1 2012. Create the aggregations in two separate queries, then join them.

Try it out See the answer

Subqueries and UNIONs

For this next section, we will borrow directly from the lesson on UNIONs—again using the Crunchbase data:

SELECT *
  FROM tutorial.crunchbase_investments_part1

 UNION ALL

 SELECT *
   FROM tutorial.crunchbase_investments_part2

It’s certainly not uncommon for a dataset to come split into several parts, especially if the data passed through Excel at any point (Excel can only handle ~1M rows per spreadsheet). The two tables used above can be thought of as different parts of the same dataset—what you’d almost certainly like to do is perform operations on the entire combined dataset rather than on the individual parts. You can do this by using a subquery:

SELECT COUNT(*) AS total_rows
  FROM (
        SELECT * FROM tutorial.crunchbase_investments_part1 UNION ALL SELECT * FROM tutorial.crunchbase_investments_part2 ) sub