HQL query causing insufficient disk space exception

XtremeBaumer :

I have following SQL query, which I translated to HQL:

SELECT f.date,
    f.name,
    SUM(f.seats) 
FROM Foo f 
WHERE EXISTS (  SELECT 1 
                FROM Foo fh 
                WHERE f.start + f.end IN (  SELECT fl.start + fl.end 
                                                            FROM Foo fl 
                                                            WHERE fl.date BETWEEN dateadd(yy,-1,fh.date) 
                                                                AND fh.date 
                                                                AND fl.name = '<name>') 
                    AND f.date = fh.date 
                    AND fh.date >= '2016-01-01'
                    AND fh.name = '<name>' ) 
    AND f.date >= '2016-01-01'
GROUP BY f.date,
    f.name 
ORDER BY f.date ASC,
    SUM(f.seats) DESC

In my application this query causes the error in the title:

Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: Could not allocate a new page for database 'TEMPDB' because of insufficient disk space in filegroup 'DEFAULT'. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.
    at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:216)
    at com.microsoft.sqlserver.jdbc.SQLServerResultSet$FetchBuffer.nextRow(SQLServerResultSet.java:4853)
    at com.microsoft.sqlserver.jdbc.SQLServerResultSet.fetchBufferNext(SQLServerResultSet.java:1781)
    at com.microsoft.sqlserver.jdbc.SQLServerResultSet.next(SQLServerResultSet.java:1034)
    at org.apache.commons.dbcp2.DelegatingResultSet.next(DelegatingResultSet.java:191)
    at org.apache.commons.dbcp2.DelegatingResultSet.next(DelegatingResultSet.java:191)
    at org.hibernate.loader.Loader.processResultSet(Loader.java:986)
    at org.hibernate.loader.Loader.doQuery(Loader.java:948)
    at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:340)
    at org.hibernate.loader.Loader.doList(Loader.java:2689)

This is obviously cause by the sheer inefficiency of the query as well as the number of times its executed as well as the amount of rows it works with.

Lets explain what the query does. Following example:

I have data on uber drivers. Each row is one drive of the driver with date (month), driver name, seats the driver had available, start location and end location.

E. g.:

Date        Name    Seats       Start   End
-------------------------------------------
7/1/2019    John    45          A       B

The data is summed per month. So John had 9 drives between A and B and each time he had 5 seats available. Of course, there are also other people who drive the same route and therefore pose as competition to John.

Date        Name    Seats       Start   End
-------------------------------------------
7/1/2019    John    45          A       B
7/1/2019    Doe     25          A       A
7/1/2019    Alice   35          A       C
7/1/2019    John    30          A       A
7/1/2019    Doe     25          A       C
7/1/2019    Alice   10          A       B
7/1/2019    Doe     5           A       B
7/1/2019    Alice   15          A       A

So for 7/1/2019 Johns "network" (all routes) had this competition:

Date        Name    Seats   Route
---------------------------------
7/1/2019    John    30      A-A
7/1/2019    Doe     25      A-A
7/1/2019    Alice   15      A-A

7/1/2019    John    45      A-B
7/1/2019    Doe     5       A-B
7/1/2019    Alice   10      A-B

As you can see, in this result, the route A-C is not listed, because John didn't drive it at all. If we expand the example data by a new month 8/1/2019:

Date        Name    Seats       Start   End
-------------------------------------------
8/1/2019    John    65          A       C
8/1/2019    Doe     25          A       A
8/1/2019    Alice   35          A       A
8/1/2019    Doe     25          A       B
8/1/2019    Alice   10          A       B
8/1/2019    Doe     5           A       C
8/1/2019    Alice   15          A       C

we can see that John only drove A-C this month. As the network should be build over a timespan of 1 year into the past (8/1/2018 to 8/1/2019), Johns network is now all three routes (A-A, A-B, A-C), but only for calculating the competitors as of 8/1/2019. For 7/1/2019, Johns network stays A-A, A-B. So the result for 8/1/2019 is this:

Date        Name    Seats   Route
---------------------------------
8/1/2019    John    0       A-A
8/1/2019    Doe     25      A-A
8/1/2019    Alice   35      A-A

8/1/2019    John    0       A-B
8/1/2019    Doe     25      A-B
8/1/2019    Alice   10      A-B

8/1/2019    John    65      A-C
8/1/2019    Doe     5       A-C
8/1/2019    Alice   10      A-C

John only drove A-C, which is the reason, why he is counted with 0 seats for the other routes.

As the results are summing the seats and ignore the route, the actual output of the query is as following:

7/1/2019    John    75          <-- 30+45
7/1/2019    Doe     30          <-- 25+5
7/1/2019    Alice   25          <-- 10+15

8/1/2019    John    65          <-- 65+0+0
8/1/2019    Doe     55          <-- 25+25+5
8/1/2019    Alice   55          <-- 35+10+10

In this result we have as routes for 7/1/2019 only A-A and A-B for Johns competitors, as there is no data before that date. For 8/1/2019 Johns network is A-A, A-B and A-C, even though he only drove A-C in 8/1/2019 (A-A and A-B were in 7/1/2019).

I hope the data I provided is understandable. If you need more clarification, just ask and I will try to explain even more.

How do I need to change my query to greatly increase the performance?

I have not used JOINs so far, because I had to join on a subquery, and that is not allowed in HQL.

_{If you need more info/clarification, feel free to ask!}

EDIT:

I know that I could have also posted on codereview.stackexchange.com, but I chose against it, because the query itself works, if only performed for 1 name and only fails for more. My understanding of codereview.stackexchange.com is, that there should only be performance improvement questions

XtremeBaumer :

I came up with this query shortly after posting the question:

SELECT f.date,
    f.name,
    SUM(f.seats) 
FROM Foo f 
WHERE f.start + f.end IN (  SELECT fh.start + fh.end 
                            FROM Foo fh 
                            WHERE fh.date BETWEEN DATEADD(yy, -1, f.date) 
                                AND f.date 
                                AND fh.name = '<name>') 
    AND f.date >= '2016-01-01' 
GROUP BY f.date,
    f.name 
ORDER BY f.date ASC,
    SUM(f.seats) DESC

As you can see, I pretty much just removed the WHERE EXISTS clause. I am not quite certain if that was the right thing to do, or what errors this could cause, but it at least solved the error at hand (insufficient disk space exception).

If you have another idea or remarks on my query, feel free to share them!

HQL query causing insufficient disk space exception

Guess you like