Monday, April 16, 2007

better SQL WITH subquery factoring

I beg you pardon if I'm not going to write about an Oracle quirk today.
On the contrary, I'll cover an interesting feature of Oracle that appeared for the first time with Oracle 9i and that only recently became one of my favorite lifesavers.

I am referring to the obscure and perhaps neglected subquery factoring feature, that is that strange syntax beginning with WITH.

If you open What's new in the SQL Reference (for Oracle 9i), i believe you can easily overlook this new query syntax flavor as it is not particularly emphasized, in spite of its virtues.

I mean, every time a new feature is introduced either you try them all soon or you just place a sort of mental bookmark on it while thinking hum, this sounds cool, I'll see where i can use it.
Then you completely forgot it.

Otherwise, if the description isn't too fancy, you may end up thinking that you can live perfectly well without it.

I must admit that probably the latter applied to my case as i didn't realize the existence of the subquery factoring clause until i read about it somewhere two years ago. But even at that time, i failed to see its potential, until i recently hit against a big SQL statement made up of a bunch of nested SQL subqueries containing bind variables.

One of the situations i see most frequently when developing with Oracle Application Express, is the need of splitting a list of string values into a recordset. This usually involves calling a function returning a user defined object type in form of an object table.

select column_value
from table(csv_to_table('firenze,roma,venezia'));

column_value
--------------
firenze
roma
venezia

Where csv_to_table is a function i created that splits up a string separated by commas by default.

Now, imagine that a colon separated list of values, returned by a multiselect list for instance, so typical of Oracle Application Express, serves as a parameter for a query, using a bind variable:

select column_value from table(csv_to_table(:user_supplied_list, ':'));

Suppose you need to perform some kind of operation involving each pair of items in the list, excluding the case where both items are the same.

This means creating a Cartesian product between two exact copies of our input list and perform such calculation:

WITH user_defined_values AS
( SELECT COLUMN_VALUE AS my_value
FROM TABLE(csv_to_table(:user_supplied_list, ':'))
)
SELECT my_function(a.my_value, b.my_value) as result
FROM user_defined_values a,
user_defined_values b
WHERE a.my_value != b.my_value;

The main advantage i see here is in that SQL statements become much easier to read.
This can be even more noticeable with more complex queries, like those taking two or three pages of conditions, GROUP BYs, OLAP functions, you know what i mean, those that you love so much, especially when they are written by someone else in a hurry...

So, this sample query can be seen as a sort of template that you can build upon for developing more complex structures while retaining some code readability.

For instance, this simple template could work well in a situation where you need to calculate the reciprocal distance between each pair of cities specified by a user, supposing you have a routing algorithm that can compute the distance between any two locations.

WITH list_of_locations AS
( SELECT COLUMN_VALUE AS location_name
FROM TABLE(csv_to_table(:user_supplied_list, ':'))
)
SELECT calculate_distance_between(a.location_name, b.location_name) as result
FROM list_of_locations a,
list_of_locations b
WHERE a.location_name != b.location_name;

Of course this is a simplification of a real problem, but let's suppose that locations are specified using a syntax like, "town,province,region", just to shake up things a little bit.

So, supposing we want to further refine the query above, by using a numerical primary key instead of the location name, we get:

WITH list_of_locations AS
(
SELECT location_id
FROM locations
WHERE (town,province,region) IN
(
SELECT get_nth_from_csv(COLUMN_VALUE,1) as town,
get_nth_from_csv(COLUMN_VALUE,2) as province,
get_nth_from_csv(COLUMN_VALUE,3) as region
FROM TABLE(csv_to_table(:user_supplied_list, ':'))
)
)
SELECT calculate_distance_between(a.location_id, b.location_id) as result
FROM list_of_locations a,
list_of_locations b
WHERE a.location_id != b.location_id;

Note also that for commutative functions, it's normally useless to compute both fn(A,B) and fn(B,A) as they would return the same result, so you can optimize the non-equijoin to return just a triangular matrix, provided the values can be compared using lesser-than (<) or greater than (>).
WITH list_of_locations AS
(
SELECT location_id
FROM locations
WHERE (town,province,region) IN
(
SELECT get_nth_from_csv(COLUMN_VALUE,1) as town,
get_nth_from_csv(COLUMN_VALUE,2) as province,
get_nth_from_csv(COLUMN_VALUE,3) as region
FROM TABLE(csv_to_table(:user_supplied_list, ':'))
)
)
SELECT calculate_distance_between(a.location_id, b.location_id) as result
FROM list_of_locations a,
list_of_locations b
WHERE a.location_id < b.location_id

Instead of N*(N-1) values, you end up with just N*(N-1)/2 values.

However, this holds only for commutative functions and certainly calculate_distance_between is not the best example of such a function.

I say this because yesterday i was on a trip with my family and we were visiting friends living in a small town near Rome.
We managed to arrive as close as 100 meters from the crossroad where theoretically we'd had to turn left, just to discover that some fanciful city planner recently decided to convert the main road from two-way to one-way.

All right i thought, piece of cake, let's go back and see how we can bypass it.
It was 12.00 PM.

At 12.25 we passed in front of the main town square for the third time in a row and i was yelling insults at the address of the mayor.

Then we decided to take a radical decision and go out of the town and see if our GPS could sort out this mess starting from a different point.
And so it did, by proposing a route of 4.8Km, around the town, a very pleasant journey through lovely hills and green fields, at the end of which we arrived at the destination from the opposite direction.

4.8 km instead of 100 meters, a small journey for a car, but a giant leap for mankind, especially for the heirs of the Romans, world's famous city planners.

When finally we met our friends they were quite surprised to see us right there because usually they have to go out and pick up the lost people along the route!
I never got lost in New York, in London or in Mexico City, but i was almost to the point of giving up the task in Mentana, a town of nearly 16,000 inhabitants or perhaps 20,000 including the sheep.

Moral:
a distance of 100 meters from B to A can easily increase up to 4800 meters when you swap A and B and you are in Mentana.

Happy factoring!

2 comments:

MD said...

It's interesting that you would blog about this just days after I did and I agree that this is one of my favorite keywords.

Your entry is very interesting in how it uses this with analytical functions, I wrote about it more from a basic SQL setup. Either way, it is clearly very useful in some situations.

My post is at OraDBDev WITH clause.

Byte64 said...

Hello MD,
thanks for dropping a comment.
This strange WITH clause has been on my agenda for quite some time, but for various reasons i never found the time for writing something about it earlier.
The fact is that blogging is a very consuming task, especially if you have this insane :-) habit of including sql snippets.
And speaking about insane habits, my trip to Mentana was very inspirational, in one way or another i had to share that funny experience.

Cheers!

yes you can!

Two great ways to help us out with a minimal effort. Click on the Google Plus +1 button above or...
We appreciate your support!

latest articles