Anyway, in processing several test batches of records, I started noticing that jobs would “hang”. I left the process going for hours, and it never released.
Now, normally, I’d go straight for the deadlock detection stuff, but my rules engine (that invokes all the various rules necessary for the processing) already has error handling code setup to deal with deadlocks. This is because the processes are so complex, and will be running on multiple threads against the server simultaneously, so deadlocks are just about unavoidable to an extent.
But, in this case, I wasn’t getting deadlocks.
Alright then, had to be blocking. But firing up Activity Monitor, plus all the usual stored procs for evaluating blocking showed nothing. The Job was definitely running, though, as reported by the SQL Agent monitor.
Even more puzzling was if I forcibly stopped the errant job, my rules engine would kick in (as it’s supposed to), rerun the failed job, and it’d run right through, no delays at all.
And finally, the real kicker. I could reset the process, run the exact same set or records through the process, and sometimes a job would hang, other times, it’d run through perfectly fine.
Ah, random bugs!
After adding some logging, I was able to narrow down the hang to a specific TSQL query within a particular stored proc.
Granted, this is summarized somewhat from the full query, but basically, it looked like this:
select def.ClientID, def.Defendant, PlayerID=MIN(pl.ID) into #tmp from dbo.GetDefendants(@IDFilter) def inner join player as pl on pl.ID = Def.PlayerID group by ClientID, Defendant
The only peculiarity is the use of a UDF (the “GetDefendants” function) to retrieve a filtered list of rows from the “Defendants” table.
The actual tables involved aren’t particularly large (generally 1000 to 3000 rows and 20 or so columns, no image or text columns).
I then loaded up the SQL Trace utility, but also came up short. The trace indicated that the query in question was indeed being invoked, but then, nothing.
At this point things are starting to look grim. I could run the particular query manually from SSMS and it always worked properly, no hangs.
So, I started trying things. breaking the query up, pushing the joined tables through TEMP tables, etc. Still got the same hang, but still only occasionally.
I finally tried this alternate version:
select def.ClientID, def.Defendant, PlayerID=pl.ID into #tmp from dbo.GetDefendants(@IDFilter) def inner join player as pl on pl.ID = Def.PlayerID Select ClientID, Defendant, PlayerID=min(PlayerID) into #tmp2 from #tmp
Note that the only real difference is that I removed the GROUP BY, and the MIN from the original query, pushed ALL the results into a temp table, then used the GROUP BY on the temp table into a second temp table.
And presto, no more hangs.
This definitely ranks up there with the weirdest SQL issues I’ve ever run into. I’m guessing it’s some combination of blocking, running the same queries (against different sets of rows) multiple times simultaneously (using multiple SQL Jobs), the fact that the server is an 8 core machine and SQL has parallelism turned on, and the use of Group By, but at this point, I haven’t been able to definitively say exactly what the problem is.