SQL for Deleting Records with No Related Records

Filed under SQL

Had another interesting SQL challenge come up today. Basically, I had two tables. I needed to delete all the records from Table A that did not have any related records in Table B.

Simple enough. The existing code was like so:

delete x
from #tmp x
left join #tmp as y on x.ID=y.ID and y.Status<>'dismissed'
where x.Status='dismissed'
    and y.ID is null

Now, in this case, Table A and Table B happen to be the same table, just interrelated, but I had similar situations with distinct tables.

Since the table in question was a temporary table, there weren’t any indexes defined, so that would account for some of the slowdown, but this query was taking forever to finish.

Just looking at the query, though, it struck me that the LEFT JOIN was likely doing way more work that necessary, which can be confirmed via a quick check of the execution plan.

The problem is that the join has to go ahead and connect all the related records that are actually related, even though, in this case, those are the very records we don’t care anything about (notice the y.Person is null clause).

I’m guessing that the optimizer can’t or doesn’t realize that we don’t actually care about those rows in this case. Not surprising, but definitely interesting.

So, a quick rewrite is in order. I chose a correlated subquery:

delete x
from #tmp x
where x.Status='dismissed'
    and not exists(
        select top 1 1 from #tmp y where y.ID = x.ID and y.Status<>'dismissed'
    )

Believe it or not, this version executed in under a second against the exact same million+ count rowset. The difference was surprising, and I’m not often surprised anymore by SQL optimization tricks.

Ok, a few things to note:

  • The table still doesn’t have any indexes. Things will likely be even faster with some good indexes on the table. But this is a temp table, so you’ll have to weigh the cost of creating the index vs the savings you get for additional queries that might use it while the table exists.
  • Since I don’t actually care about any of the related records, I select the scalar value “1” for the result of the subquery. Technically, this is not actually necessary since the NOT EXISTS clause is used, which usually causes the SQL optimizer to automatically forgo returning any actual data. I just like to be explicit about things.
  • Further, since I only want to know whether any related records exist or not, I can select only the TOP 1 related record, and get it or nothing. This allows some additional SQL optimizations that center around use of the TOP clause, which is a good thing.
  • And finally, use of the NOT EXISTS clause allows even more internal SQL optimizations.

Bottom line

Definitely keep correlated subqueries in your SQL toolbelt. This one example shaved more than 20 minutes off the execution of a particular stored proc.

Are correlated subqueries always the answer? Oh, hell no. Often, they’re the worst approach.

Are there other ways to do this? Certainly. A CROSS JOIN is the first thing that comes to mind. Establishing indexes on the temp table would certainly help as well.

But as fast as this solution is, I didn’t take things any farther.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*