By Riley Major, 2018-09-14
So you’ve wisely decided to put your business logic in your database, but for whatever reason you can’t use inline user defined functions, even though they encapsulate logic without hurting performance. That’s ok, because you can build functions on the fly in a single statement.
Sort of.
Tables on the Fly
You might have heard of Common Table Expressions (CTEs), T-SQL’s tool for building tables on the fly in the single statement. They’re a lot like Derived Tables, in that they’re SELECT
statements which can be used by other portions of an encapsulating SELECT
statement. But unlike Derived Tables, they can be used multiple times (in the same statement). It’s not accurate to think of them as statement-level temporary tables, because they aren’t filled up with data as a discrete step. Except for their recursive call feature, they are tantamount to repeating the contained SELECT
statement multiple times in the main SELECT
statement.
For example, this SELECT
statement repeats the same Derived Table multiple times.
SELECT a1 = t1.a, a2 = t2.a FROM ( SELECT a = n + 1 FROM @t ) AS t1 CROSS JOIN ( SELECT a = n + 1 FROM @t ) AS t2;
This alternative SELECT
statement creates a Common Table Expression for that Derived Table’s contents and references the CTE twice.
WITH t AS ( SELECT a = n + 1 FROM @t ) SELECT a1 = t1.a, a2 = t2.a FROM t AS t1 CROSS JOIN t AS t2;
They have the same query plan. And there is no separate step in the query plan for the “population” of the supposed temporary table.
So if you’re going to repeatedly use the same *data* multiple times in a query, the Common Table Expression is your tool of choice. But what if you want to use the same logic?
Pump it up.
Unfortunately, there’s no similar syntactical sugar to create a statement-level user-defined function. There’s no direct way to have the same calculation applied to different columns of data without creating an actual user-defined function or simply repeating the logic. But there is a neat trick which in the right circumstances can accomplish the same goal without costing you any performance.
The goal, as with any function, is not repeating yourself. It’s not just a bunch of extra characters in your code. It’s an accident waiting to happen– two representations of the same logic which are just bound to diverge due to our cursed human imperfections.
But how can you run different columns of data through the same calculation in a query? Turn them into rows!
Calculations are already performed repeatedly for every row in a query. So all you have to do is take your columns and explode them up into multiple rows. In order to apply a calculation to Column1
and Column2
, first change each row of the data into two more rows with a new, computed Column3
which has the original row’s data for Column1
first and the data for Column2
second.
In other words, you need to convert this:
Column1 | Column2 |
---|---|
ABC | DEF |
GHI | JKL |
Into this:
Column1 | Column2 | Counter | Column3 |
---|---|---|---|
ABC | DEF | 1 | ABC |
ABC | DEF | 2 | DEF |
GHI | JKL | 1 | GHI |
GHI | JKL | 2 | JKL |
Once you have that vertical list of the columns’ values, you can APPLY
your calculation to it, once, and it will operate on all of the values.
Column1 | Column2 | Counter | Column3 – Altered |
---|---|---|---|
ABC | DEF | 1 | CBA |
ABC | DEF | 2 | FED |
GHI | JKL | 1 | IHG |
GHI | JKL | 2 | LKJ |
All that’s left is to smoosh those vertically-represented values back into a single horizontal row. You could use PIVOT
, but that syntax is a little tricky. Instead, first duplicate the calculated column and then selectively suppress the values you don’t want (by using a CASE
expession). You’ll end up with this:
Column1 | Column2 | Counter | Column1 – Altered | Column 2 – Altered |
---|---|---|---|---|
ABC | DEF | 1 | CBA | |
ABC | DEF | 2 | FED | |
GHI | JKL | 1 | IHG | |
GHI | JKL | 2 | LKJ |
If you take the MAX
of each of those calculated columns, you end up with a single row for each one, with your final results:
Column1 | Column2 | Column1 – Altered | Column 2 – Altered |
---|---|---|---|
ABC | DEF | CBA | FED |
GHI | JKL | IHG | LKJ |
And voila! You’ve taken a statement with repetitive code:
SELECT ContactName = c.ContactName, Phone1Formatted = REPLACE(REPLACE(REPLACE(REPLACE(c.Phone1,'(',''),')',''),' ',''),'-',''), Phone2Formatted = REPLACE(REPLACE(REPLACE(REPLACE(c.Phone2,'(',''),')',''),' ',''),'-','') FROM @Contacts AS c;
And eliminated that repetition:
SELECT ContactName = c.ContactName, Phone1Formatted = t.Phone1Formatted, Phone2Formatted = t.Phone2Formatted FROM @Contacts AS c CROSS APPLY ( SELECT Phone1Formatted = MAX(CASE WHEN Exploder.n = 1 THEN PhoneFormatter.PhoneFormatted ELSE '' END), Phone2Formatted = MAX(CASE WHEN Exploder.n = 2 THEN PhoneFormatter.PhoneFormatted ELSE '' END) FROM (SELECT n = 1 UNION ALL SELECT 2) AS Exploder CROSS APPLY ( SELECT Phone = CASE Exploder.n WHEN 1 THEN c.Phone1 WHEN 2 THEN c.Phone2 ELSE NULL END ) AS Chooser CROSS APPLY ( SELECT PhoneFormatted = REPLACE(REPLACE(REPLACE(REPLACE(Chooser.Phone,'(',''),')',''),' ',''),'-','') ) AS PhoneFormatter ) AS t;
Admittedly, there is more overall code in this simplified example, but imagine that the logic was more complex. And also, remember that it’s not about character count, but rather removing duplication because of the potential for different behaviors if the copied code isn’t perfectly maintained.
Unfortunately, unlike with Common Table Expressions, this is not mere syntactic sugar. This is semantically different T-SQL which makes the engine do more work– first, spreading apart the columns into rows before shoving them through your calculation and then squishing them back down again. But in the right circumstances, it might add only trivial overhead, as in this example.
Still, use this technique with care. It’s almost always better to just make actual inline user-defined functions, which often have identical performance with the benefit of code encapsulation, as they do in this example.
Ceci est une pipe
This isn’t the only way to reduce code duplication in your complex T-SQL statements. If you have two calculations which have similar beginnings but differ in the final steps, break them apart into their separate steps so that you can re-use the first steps of the calculation.
SELECT Palindrome1 = SecondaryOperations.Palindrome1, Palindrome2 = SecondaryOperations.Palindrome2 FROM (SELECT d = 'abc') AS RawData CROSS APPLY ( SELECT r = REVERSE(RawData.d) ) AS BaseOperation CROSS APPLY ( SELECT Palindrome1 = RawData.d + BaseOperation.r, Palindrome2 = BaseOperation.r + RawData.d ) AS SecondaryOperations;
It’s helpful to think of each APPLY
as a pipe operation, taking the values from the previous derived table and passing them into the next to be manipulated. Programming T-SQL in this manner (loosely) approximates modern functional programming techniques.
It keeps each step of the logic smaller, so that it’s easier to understand. And you can expose the intermediary columns to help with debugging.
Department of Redundancy Department
You are going to have business logic in your database, whether you like it or not. Use these and other techniques to make sure it’s only in there once.
References
Photo credit: Water Bird by Evan