Collapsing Date Ranges in T-SQL

Filed under Code Garage, SQL

imageI’ve been working on contract for a month or so now, helping to speed up some back end database summarization activity that had gotten slow enough that it was threatening to bleed into the next day’s time frame. Yuck!

Mostly standard stuff, tweaking indexes, ditching cursors, etc.

But one problem had me scratching my head today.

Essentially, I had a table of Clients, each one of which could be linked to any number of “Exposure” records, each of those having a start and stop date of exposure.

The trick was to determine how many total years of exposure a client had.

The thing is, each client might have multiple exposure records with overlapping (or not) time frames. So essentially, the problem boiled down to collapsing all the exposures to a single sequential list of non-overlapping exposure timeframes. From there, it’s trivial to just add up the differences of the date for each time frame.

But how to get there?

Cursors

The existing code was working fine, but took upwards of 40+ minutes. Essentially, it worked via cursors and functions (with more cursors) to collect all the years of all the timeframes for each client, convert them to a list of singular year elements, then convert that to a recordset and finally count up the entries. Workable, but terribly slow.

Skinning the Cat

I’d done something similar ages ago for a medical billing system, so I knew this kind of manipulation could be fast. But I’d long since forgotten exactly how I’d done it.

However, a few Google searches and I landed on Peter Larsson’s blog post about collapsing date ranges using what he calls the “Clustered Index Update”. It’s 3 years old, but definitely something worth throwing in your bag of SQL tricks!

First, create some test data:

create table #test(
   id int,
   seq int,
   d1 datetime,
   d2 datetime)

insert into #test
select 1, null, '2005', '2006' union all
select 1, null,'2007', '2009' union all
select 2, null,'2001', '2006' union all
select 2, null,'2003', '2008' UNION ALL
SELECT    3, null,'2004', '2007' UNION ALL
SELECT    3, null,'2005', '2006' UNION ALL
SELECT    3, null,'2001', '2003' UNION ALL
SELECT    3, null,'2002', '2005' UNION ALL
SELECT    4, null,'2001', '2003' UNION ALL
SELECT    4, null,'2005', '2009' UNION ALL
SELECT    4, null,'2001', '2006' UNION ALL
SELECT    4, null,'2003', '2008'

Next, make sure you have a clustered index across the ID and both Date fields:

CREATE CLUSTERED INDEX ix_id ON #test (ID, d1, d2) with fillfactor = 95

Be sure that the SEQ field is initialized to NULL or 0 (already done via the population code above).

Then, create several variables to assist with counting through the records to set the SEQ field. Use a SELECT to initialize those variables:

DECLARE    
    @id INT,
    @Seq INT,
    @d1 DATETIME,
    @d2 DATETIME
SELECT TOP 1
    @Seq = 0,
    @id = id,
    @d1 = d1,
    @d2 = d2
FROM #test
ORDER BY id, d1

The Trick

Finally, update the SEQ column using the “Clustered Index Update” trick:

UPDATE #test
SET   
    @Seq = CASE
        WHEN d1 > @d2 THEN @Seq + 1
        WHEN id > @id THEN @Seq + 1
        ELSE @Seq
        END,
    @d1 = CASE
        WHEN d2 > @d2 THEN d1
        WHEN id > @id THEN d1
        ELSE @d1
        END,
    @d2 = CASE
        WHEN d2 > @d2 THEN d2
        WHEN id > @id THEN d2
        ELSE @d2
        END,
    Seq = @Seq,
    @id = id

Essentially, what’s happening here is that since the update doesn’t specify an order, SQL will update via the physical order in the database, which is the same as the clustered index (a clustered index determines the physical ordering of records in the table). And since the records are ordered in ID, D1, D2 order, the SEQ column will be updated with an incrementing number that effectively clusters overlapping ranges together.

Since the records are already physically in that order, this update happens lightning fast because there’s no need to perform any index lookups.

You can see the end result by selecting all the records at this point:

select * from #test

Now, the data is ready, you just have to query it using that SEQ column. For instance, this SELECT will retrieve the start and end date of each non-overlapping cluster of dates belonging to each ID.

SELECT      
    ID,
    MIN(d1) AS d1,
    MAX(d2) AS d2
FROM #test
GROUP BY id, Seq
ORDER BY Seq

Mr. Larsson also describes a query to retrieve the “gaps” (or missing date ranges), which could be handy for a different class of problem.

If, like me, you also need a grand total of the number of years, first, you can get the years in each collapsed timeframe and then get the grand total years, like this:

select 
    ID,
    Years = Sum(Years)
From (
    SELECT     
        ID,
        Years=Year(MAX(d2)) - Year(Min(d1)) + 1
    FROM #test
    GROUP BY id, Seq
    ) a
Group by id
Order by id    

Using this trick took this particular query (actually a sizable set of queries and cursor loops) from 40+ minutes to under a minute, with virtually all of that minute being spent filtering down the set of records that I needed to specifically collapse the timeframes on (in other words, doing stuff unrelated to actually collapsing the date ranges). In all, several million records being processed in a few seconds now.

Good stuff.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*