Fun With SQL (OVER and PARTITION clauses)

Filed under Hiring, SQL

I’ve been interviewing over the last few weeks for a new position (If anyone has a need for a very experienced .Net developer with some recent Ruby/Rails/Javascript/Coffeescript/JQuery etc chops, shoot me an email!)

At any rate, one company had an interesting SQL challenge.

In a nutshell, given a table with an ID field and a FLAG bit field, find all the consecutive rows where the FLAG field is 1 (ie a bad record).

The trick was that the output needed to look like:

ID RANGE        Error Count
4000-6000       2001

First, lets insert a CSV file into a table….

BULK INSERT Flags
FROM 'c:\dev\darin\sqltest\sqltest.csv'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n',
FIRSTROW = 2
)
GO

They supplied a 1.6 million row sample table, and expected sub 10 second response time. Sooooo, you also might want to add a clustered index to the ID field.

Let’s see some SQL!

With the sample table in place, I could get down to business.

I’ll admit, I had to google the syntax of the OVER and PARTITION clauses in T-SQL. They’re not something I use every day.

First, I needed all the rows, with a group number grouping all the consecutive rows who’s flag value was 1. I used a Common Table Expression for this. I could have used temp tables and it would have resulted in about the same performance, but… what the hey!

WITH parts AS (
   SELECT *, id - ROW_NUMBER() OVER (ORDER BY id) AS grp
   FROM flags
   WHERE flag = 1
)

Next, I needed the count of each group as part of the result set….

WITH counts AS (
   SELECT *, COUNT(*) OVER (PARTITION BY grp) AS cnt
   FROM parts
)

With those two CTEs in place, all that was left was the output Select….

SELECT cast(min(id) as varchar(20)) + '-' + cast(max(id) as varchar(20)) as IDRange, 
   cnt as ErrorCount
FROM counts
WHERE cnt >= 1000
GROUP BY grp, cnt
ORDER BY IDRange

In this case, I only cared about blocks of errors of more than 1000 rows. The concatenated SELECT is just to make a nice output format. realistically, I’d probably only select MIN(ID), MAX(ID), CNT

I haven’t used CTE’s a lot, but I’m finding more and more uses for them.

And the OVER clause is something I really need to get more familiar with.

For the 1.6 million row source table, I was getting sub 2 second response times on this query. No doubt, there’s ways to speed this up, but that’s not too shabby.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*