Describing outer joins in metadata
Dear Metadata fans,
It’s been a while that I blogged about Pentaho Metadata. This is undeservedly so because the last couple of months, a lot of things have been moving. Most of that is not really visible to the end-user. The GUI part of the metadata suite was attacked last year and doesn’t really need all that much work. What we have been doing is extending the underlaying architecture by making it more flexible, more robust and easier to program from an API viewpoint. Most of that work has been in the capable hands of Pentaho rock star Will Gorman. The work he did last year for example was building in support for libformula (Open Office formula) by Pentaho reporting wizard Thomas Morgner.
Lately, Will and I started on the next big thing: adding support for outer joins in Pentaho Metadata models. We knew going in that it was Pandora’s box that we were opening. It’s so much easier to say: people, just write a sane star schema, but in the end it can make sense to just throw a few models over an ODS or source system, for example for prototyping and evaluation reasons.
The problem with outer joins…
The problem with outer joins is that you need to know exactly what you are doing if you want to get good reliable results out of your metadata model. That is because the order at which you execute the various outer joins determines the eventual outcome of the data.
Take a look at the Wikipedia entry on joining and outer joining to know what we’re talking about. The situation with 2 tables can be explained quite easily: Table A has 5 records, you do a left outer join with Table B but only 3 records match. Well, you still have 5 records.
However, if you start to get 3, 4 or heaven forbid even more (outer) joins in the same query, the order in which you are going to execute the joins is going to have a strong impact on the actual result of the query. (See also: this scenario) Suppose you have a Table C with 2 records. B is left outer joined with C. If you do ( A Left Outer Join B ) and then Left outer Join that result to C, you have 5 records. If you do B left outer join with C (3 records) and then right outer join with A, you get 3 records.
In the distant past I’ve had to deal with those situations a few times in closed source BI applications and I swore I would never mess with outer joins again because of this join order problem. In an ad-hoc reporting situation, this is a real issue since the users don’t care and don’t want to know about this.
How did we solve it?
Well, if the join execution order determines the result, that order in itself has to become part of the metadata, and so that’s what we did add to the model.
We think that the order in outer join situations is as important as the relationships themselves and as such, the selected join order key that is entered is also displayed in the graphical model view.
You can enter this join order key in the relationship dialog:
As you can see, we also added the 0:0 relationship type that corresponds to the “Full Outer” join type.
What does the SQL look like?
If we were to select columns from all databases, apply a condition to the second table, add a condition to an aggregate and set a sort order, we get an SQL query like this:
BT_TABLE1_TABLE1.PRIMARYKEY AS COL0
,BT_TABLE2_TABLE2.PRIMARYKEY AS COL1
,SUM(BT_TABLE1_TABLE1.PRIMARYKEY) AS COL2
,BT_TABLE4_TABLE4.PRIMARYKEY AS COL3
FROM TABLE4 BT_TABLE4_TABLE4 LEFT OUTER JOIN
TABLE1 BT_TABLE1_TABLE1 FULL OUTER JOIN TABLE2 BT_TABLE2_TABLE2
ON ( BT_TABLE1_TABLE1.PRIMARYKEY = BT_TABLE2_TABLE2.FOREIGNKEY AND ( BT_TABLE2_TABLE2.FOREIGNKEY = 2 ) )
ON ( BT_TABLE2_TABLE2.FOREIGNKEY = BT_TABLE4_TABLE4.PRIMARYKEY )
SUM(BT_TABLE1_TABLE1.PRIMARYKEY) = 2
Note that we use a nested join syntax to make sure that the requested join order is followed. What we also try to do is place the conditions as close to the join as possible for performance reasons.
What else did we change?
As you can see from the sample query above, we abandoned the aliasing of business tables with their name. Because most if not all names contain spaces, it uses quoting all over the place and it runs into some limitations if your names are too long on Oracle for example. As such, we replaced it by the business table ID. The end user never sees these queries anyway.
Another thing we modified is the way that formulas in business columns are processed. You can now include business columns from any columns in the business model in a single expression. This is obviously something that is also important when you do reporting on a 3NF model. Almost all objects you want to calculate with are going to be in a different business table. See this wiki page on Pentaho Metadata Formulas for more information.
The next things on the agenda are undoubtedly things like row level security and a special section for predefined conditions and calculated members on the business model level. There are also rumors that the long awaited Formula editor is nearing completion so we’ll be able to use that in the Pentaho Metadata Editor in the future as well. As always, you can play an important role in the evolution of Pentaho Metadata by letting us know what you love or hate about Pentaho Metadata. In the end, it is feedback from our customers and our community that drives all our software development!
Until next time,